<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Toward a Thermodynamics of Meaning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Jonathan</forename><forename type="middle">Scott</forename><surname>Enderle</surname></persName>
							<email>enderlej@upenn.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Pennsylvania Libraries</orgName>
								<address>
									<addrLine>3420 Walnut St</addrLine>
									<postCode>19104-6206</postCode>
									<settlement>Philadelphia</settlement>
									<region>PA</region>
									<country>United States of America</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Toward a Thermodynamics of Meaning</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3AA0CF6FA4E7CF43B12B4AF0CC10060E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T22:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>language modeling</term>
					<term>natural language semantics</term>
					<term>artificial intelligence</term>
					<term>statistical mechanics</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>As language models such as GPT-3 become increasingly successful at generating realistic text, questions about what purely text-based modeling can learn about the world have become more urgent. Is text purely syntactic, as skeptics argue? Or does it in fact contain some semantic information that a sufficiently sophisticated language model could use to learn about the world without any additional inputs? This paper describes a new model that suggests some qualified answers to those questions. By theorizing the relationship between text and the world it describes as an equilibrium relationship between a thermodynamic system and a much larger reservoir, this paper argues that even very simple language models do learn structural facts about the world, while also proposing relatively precise limits on the nature and extent of those facts. This perspective promises not only to answer questions about what language models actually learn, but also to explain the consistent and surprising success of cooccurrence prediction as a meaning-making strategy in AI.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Since the introduction of the Transformer architecture in 2017 <ref type="bibr" target="#b28">[29]</ref>, neural language models have developed increasingly realistic text-generation abilities, and have demonstrated impressive performance on many downstream NLP tasks. Assessed optimistically, these successes suggest that language models, as they learn to generate realistic text, also infer meaningful information about the world outside of language.</p><p>Yet there are reasons to remain skeptical. Because they are so sophisticated, these models can exploit subtle flaws in the design of language comprehension tasks that have been overlooked in the past. This may make it difficult to realistically assess these models' capacity for true language comprehension. Moreover, there is a long tradition of debate among linguists, philosophers, and cognitive scientists about whether it is even possible to infer semantics from purely syntactic evidence <ref type="bibr" target="#b25">[26]</ref>.</p><p>This paper proposes a simple language model that directly addresses these questions by viewing language as a system that interacts with another, much larger system: a semantic domain that the model knows almost nothing about. Given a few assumptions about how these two systems relate to one another, this model implies that some properties of the linguistic system must be shared with its semantic domain, and that our measurements of those properties are valid for both systems, even though we have access only to one. But this conclusion holds only for some properties. The simplest version of this model closely resembles existing word embeddings based on low-rank matrix factorization methods, and performs competitively on a balanced analogy benchmark (BATS <ref type="bibr" target="#b8">[9]</ref>).</p><p>The assumptions and the mathematical formulation of this model are drawn from the statistical mechanical theory of equilibrium states. By adopting a materialist view that treats interpretations as physical phenomena, rather than as abstract mental phenomena, this model shows more precisely what we can and cannot infer about meaning from text alone. Additionally, the mathematical structure of this model suggests a close relationship between cooccurrence prediction and meaning, if we understand meaning as a mapping between fragments of language and possible interpretations. There is reason to believe that this line of reasoning will apply to any model that operates by predicting cooccurrence, however sophisticated. Although the model described here is a pale shadow of a hundred-billion-parameter model like GPT-3 <ref type="bibr" target="#b4">[5]</ref>, the fundamental principle of its operation, this paper argues, is the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Previous Work</head><p>Most recent work on language modeling builds on the word2vec word embedding model <ref type="bibr" target="#b19">[20]</ref> and its descendants such as GloVe <ref type="bibr" target="#b22">[23]</ref>. These models drew from a longer tradition of distributional semantics in linguistics <ref type="bibr" target="#b10">[11]</ref> [7] and early machine translation research <ref type="bibr">[30] [18]</ref> [17] <ref type="bibr" target="#b7">[8]</ref>. The promise of word embedding models for research in the humanities was quickly recognized, leading to historical studies of analogical language <ref type="bibr" target="#b11">[12]</ref> and diachronic lexical change <ref type="bibr" target="#b9">[10]</ref>, but questions remained about the utility of embeddings for close humanistic analysis. Word embedding models suffer from stability problems, yielding seemingly precise answers that change when training input is modified only slightly <ref type="bibr" target="#b0">[1]</ref>, and their internal geometric structure is poorly understood <ref type="bibr" target="#b20">[21]</ref>.</p><p>Attempts to build a better theoretical understanding of word embeddings have often focused on exploring the ways different models prove to be mathematically equivalent in some limit <ref type="bibr" target="#b14">[15]</ref> [2], or showing the importance of preprocessing and hyperparameter selection. In many cases, with optimal hyperparameter choices, factorizing word cooccurrence matrices using SVD and a log weighting is sufficient to produce results competitive with state-of-the-art models <ref type="bibr" target="#b15">[16]</ref>  <ref type="bibr" target="#b8">[9]</ref>. For these reasons, the claim that word embeddings are indeed representations of meaning, and not merely dense representations of word cooccurrence, still lacks strong theoretical support. On the other hand, even simple coocurrence data seems intuitively to capture something about meaning in a way that remains mysterious <ref type="bibr" target="#b1">[2]</ref>.</p><p>More recent language modeling has focused on sequence prediction, either using recurrent neural networks <ref type="bibr" target="#b23">[24]</ref> or attention-based mechanisms <ref type="bibr" target="#b28">[29]</ref>. Large language models using the Transformer architecture apparently capture rich semantic information usable in a range of downstream applications <ref type="bibr" target="#b12">[13]</ref>. But as with word embeddings, there remain empirical and theoretical reasons to be skeptical that these models are capturing information about meaning, rather than performing an extremely sophisticated and accurate version of positionally-aware cooccurrence prediction. At least some attempts to use Transformer models to perform challenging natural language comprehension tasks have shown that existing problem datasets contain subtle linguistic cues that leak information about correct answers <ref type="bibr" target="#b21">[22]</ref>  <ref type="bibr" target="#b18">[19]</ref>. These cues have been missed in the past, but with their linguistic sophistication, newer models recognize them, producing spurious state-of-the-art results without demonstrating true comprehension.</p><p>Recent work by Bender and Koller <ref type="bibr" target="#b3">[4]</ref> provides an even stronger theoretical case against the claim that language models infer meaning beyond simple cooccurrence. Synthesizing arguments and evidence from linguistics and philosophy, including Searle's famous Chinese Room argument <ref type="bibr" target="#b25">[26]</ref>, Bender and Koller argue that "the language modeling task, because it only uses form as training data, cannot in principle lead to learning of meaning." Or, in Searle's pithy formulation, the operations of a computer have "syntax but no semantics." Bender and Koller's reliance on Searle is notable, given that Searle's argument was not against language modeling, but against the very possibility of artificial intelligence. Anyone who takes his reasoning entirely seriously should forever abandon the notion that a computational process could truly comprehend meaning. Yet in their final analysis, Bender and Koller back away from Searle's strongest claims, acknowledging that "if form is augmented with grounding data of some kind, then meaning can conceivably be learned to the extent that the communicative intent is represented in that data," and that a sufficiently successful language model "has probably learned something about meaning."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Meaning, Cooccurrence, and Thermodynamics</head><p>How can we synthesize these seemingly contradictory bodies of theory and evidence? It's plausible to claim that language models can never do more than predict the way elements of language cooccur in text, since they never see any other kind of evidence. And yet even the simplest kinds of cooccurrence prediction, such as basic matrix factorization, produce surprisingly good representations of something that looks intuitively like meaning. Suppose that rather than examining the details of particular language models to see how they differ, and which might be more or less correct, we focus on what they have in common. Is there some unrecognized connection between meaning and cooccurrence prediction in all its forms? This section proposes such a connection based on a model borrowed from statistical mechanics. Similar approaches have been applied to practical language modeling problems <ref type="bibr" target="#b27">[28]</ref> [27] and theoretical discussions of algorithmic and semantic information <ref type="bibr">[3] [14]</ref>. But to the author's knowledge, no prior work has used thermodynamic analogies to specifically investigate the relationship between language and its semantic domain.</p><p>This model begins by treating interpretations as possible configurations of an unknown physical system. It then constructs a statistical mechanical partition function that counts the number of interpretations applicable to each fragment of language in a corpus. It immediately follows that the Hessian of that function-the matrix of its mixed second partial derivatives-is a covariance matrix describing word cooccurrences. The Hessian matrix can be used, in turn, to approximate directional derivatives of the partition function, which describe the ways the partition function changes when the meanings of words are slightly modified. These directional derivatives are word vectors, with all the expected properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Model Assumptions</head><p>Setting up our model requires some odd assumptions about how language works. To begin with, it requires that we assume that meaning is quantifiable in the most naive way. It's not uncommon in colloquial speech to talk about the amount of meaning a phrase has, without specifying what the phrase means. Some phrases, we might say, are meaningless; others are full of meaning. To construct a statistical mechanical model of meaning, it is useful to assume that this is a perfectly correct way of quantifying meaning, and that, so quantified, meaning is a conserved value that plays the same role as energy in a typical thermodynamic ensemble.</p><p>As long as we are making extravagant assumptions, let's also assume that for a given linguistic system and an associated semantic domain, words have a stable average capacity for holding meaning, and that word counts are conserved values just like energy, so that a combined linguistic system and associated semantic domain contains an unknown but fixed number of copies of every possible word. Leaving aside the linguistic significance of these assumptions for a moment, we can skip ahead by recognizing them as formally equivalent to the assumptions made in the construction of the grand canonical ensemble.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">The Grand Canonical Ensemble</head><p>In classical thermodynamics, the grand canonical ensemble describes a system of particlessuch as a container of gas-that is in thermodynamic and chemical equilibrium with a much larger system, a reservoir of energy and particles. Concretely, this means that the temperature of the gas in the container is the same as its surroundings (assumed to be homogenous, and far larger than the container), and that the container can exchange particles with its surroundings, but at a steady state, so that it is as likely to lose a particle as to gain a particle at any given moment. Furthermore, both the amount of energy and the number of particles shared between the container and its surroundings are fixed-energy and particle number are conserved values.</p><p>To understand the behavior of this ensemble, we begin by imagining that we could track the exact position and momentum of every particle in the system (container), as well as the exact position and momentum of every particle in the reservoir (surroundings of the container). At a given instant in time, these values constitute a "microstate." Since we have both a system and a reservoir, we can divide a single microstate into parts, considering just the microstate of the system, or just the microstate of the reservoir. We can also determine that certain system microstates are incompatible with certain reservoir microstates, because the combination would violate a conservation law. In other words, for some pairs of system microstate and reservoir microstate to coexist, energy or particles would have to be created or destroyed, which would violate the rule that energy and particle number are conserved values.</p><p>If we rule out all system-reservoir microstate pairs that are not compatible (̸ ↔), and assume that all reservoir microstates are equally likely-an acceptable approximation when the reservoir is far larger than the system-then we can approximate the probability of a given system microstate s i by counting the number of reservoir microstates that are compatible (↔) with it. Using Iverson brackets ([i = j] = δ ij ) we can say</p><formula xml:id="formula_0">p i ∝ ∑ j [r j ↔ s i ]<label>(1)</label></formula><p>To recover the probability itself, we can divide by the sum over all s:</p><formula xml:id="formula_1">p i = ∑ j [r j ↔ s i ] ∑ j,k [r j ↔ s k ]<label>(2)</label></formula><p>These sums are very large, and it's not clear how to calculate them. But it turns out we don't need to. Given a few standard assumptions from thermodynamics, our assumptions about conserved quantities, and a bit of calculus, it's possible to use them to derive the following function:</p><formula xml:id="formula_2">Z = ∑ i e β(µN i −E i )<label>(3)</label></formula><p>This is the grand canonical partition function. From it, we can then directly calculate the probability of system microstate i like so:</p><formula xml:id="formula_3">p i = e β(µN i −E i ) Z (<label>4</label></formula><formula xml:id="formula_4">)</formula><p>This formula tells us, first, that at a given fixed temperature T determined by β = 1 k B T , system microstates with more energy (E i ) are less probable, because they are compatible with fewer reservoir microstates. It also tells us that for any given energy level, system microstates containing more particles (N i ) with a higher chemical potential (µ) are more probable. This is because given two systems with the same energy, the system with a higher overall potential has a higher energy capacity.</p><p>This partition function can be extended to systems that have multiple kinds ("species") of particles. In that case, each species has its own chemical potential and count. For a system with k different species</p><formula xml:id="formula_5">Z = ∑ i e β(µ 1 N 1,i +µ 2 N 2,i +...+µ k N k,i −E i )</formula><p>(5)</p><formula xml:id="formula_6">p i = e β(µ 1 N 1,i +µ 2 N 2,i +...+µ k N k,i −E i ) Z (6)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">From Compatibility to Interpretation</head><p>What does all this have to do with language? The first hint that the grand canonical partition function might have some usefulness as a model for language is that energy and meaning (in the naive quantitative sense described above) both impose similar compatibility constraints on the system and reservoir. Just as higher-energy states in the system correspond to fewer possible reservoir states, more meaningful sentences correspond to fewer possible interpretations. A statement with less meaning has less precision, while a statement with more meaning has more precision, eliminating a larger number of possible interpretations. The line of reasoning is similar for particle species and words. Just as a particle species with higher chemical potential has higher energy capacity, a word with a higher "semantic potential" has a higher capacity for meaning. Consider, for example, the sentence "It stinks." Then compare it to "On January 15, 2008, a rainfall of 110mm was recorded in the city of Dubai." The specific interpretations these sentences can be given will depend on context, and in some contexts, "It stinks" might be a meaningful and precise sentence. But on balance, we should expect "It stinks" to be less meaningful than "On January 15...," both because it contains fewer words, and because the words it contains are less precise than words like "rainfall" and "Dubai."</p><p>Although this is a simple way of thinking about meaning, it is not as simplistic as it may seem at first. Consider the sentence "Ask for me tomorrow, and you shall find me a grave man," as uttered by a dying Mercutio. One might think that by the logic above, this sentence would be made less meaningful by the presence of an ambiguous word, "grave," here meaning either "serious" or "a place of burial." But a more careful analysis leads to a different conclusion. If these two senses were available independently, and the sentence could be properly interpreted in two different ways, it would indeed be less meaningful because of this ambiguity. In this context, however, choosing just one of those senses to the exclusion of the other would yield a misreading of the sentence. It does not invite two different possible interpretations; it invites one interpretation that combines together two distinct concepts both conveyed by the word "grave." By eliminating interpretations that do not combine these two senses together, this sentence uses ambiguity to achieve a higher degree of precision. Analyzed this way, literary language is often likely to be more precise and meaningful than everyday language, despite sometimes having greater surface ambiguity.</p><p>If we translate these ideas into a mathematical form, and start thinking about compatibility (↔) as a semantic relationship, then equation 2 says roughly that the probability of a given sentence (s i ) is equal to the number of interpretations (r j,..,k ) it has, divided by the number of interpretations that all possible grammatically correct sentences have. The refinement of that equation to equation 6 now says that sentences with more meaning are less probable, because they are compatible with fewer interpretations, and that for any given degree of meaningfulness, sentences with a higher semantic potential are more probable. (That is, precise sentences are harder to write, but it's easier to write a precise sentence with more words, and it's harder to pack all your meaning into just a few very precise words.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">From Ensembles to Vectors</head><p>Most word embedding models generate word vectors by using a supervised or semi-supervised model to predict cooccurrences, and the vectors themselves aren't significant outside that predictive frame. But the picture is quite different for statistical-mechanical models such as this one. One of the most elegant properties of partition functions is that a wide range of thermodynamic quantities can be expressed directly as partial derivatives of the partition function or its logarithm.</p><p>For example, suppose we would like to determine the number of particles of a particular kind present in all possible states of our system (N k ), and take the average. We can calculate that value by taking the partial derivative of the logarithm of equation 6 with respect to the chemical potential of that species, and dividing out β = 1 k B T .</p><formula xml:id="formula_7">⟨N k ⟩ = 1 β ∂ ln ∂µ k Z(µ k )<label>(7)</label></formula><p>Since ∂ ln ∂x f (x) = ∂ ∂x f (x)/f (x), this simplifies to a probability-weighted sum of N k counts divided by β, effectively an arbitrary constant multiplier. Shifting it to the left hand side of the equation gives</p><formula xml:id="formula_8">β⟨N k ⟩ = ∂ ∂µ k Z(µ k ) Z(µ k ) = ∑ i N k,i e β(µ 1 N 1,i +µ 2 N 2,i +...+µ k N k,i −E i ) Z = ∑ i N k,i p i (<label>8</label></formula><formula xml:id="formula_9">)</formula><p>This line of reasoning can be extended to second partial derivatives. The variance of N k is given by</p><formula xml:id="formula_10">β [ ⟨N 2 k ⟩ − ⟨N k ⟩ 2 ] = ∂ 2 ln ∂µ 2 k Z(µ k )<label>(9)</label></formula><p>Similarly, the covariance of N k and N j is a mixed partial derivative.</p><formula xml:id="formula_11">β [⟨N k N j ⟩ − ⟨N k ⟩⟨N j ⟩] = ∂ 2 ln ∂µ k µ j Z(µ k , µ j )<label>(10)</label></formula><p>These last two equations can be used to construct a matrix that has two simultaneous meanings. It is, first, a covariance matrix that describes the way particle counts are correlated with one another in the system. But it is also a Hessian matrix of second partial derivatives, meaning that it describes the way small modifications to the chemical potential terms change the overall partition function, shifting its energy balance across all possible system microstates. This means that even if we can't construct the partition function itself, we can in principle measure the covariance of particles empirically, and use the resulting matrix to reconstruct information about the partition function and the thermodynamic ensemble it describes.</p><p>If we translate this into linguistic terms, we find that by taking empirical measurements of word cooccurrence, we are also constructing the Hessian of a linguistic partition function that describes how changes to the meaning of one word affect the meaning of another. The columns of that matrix are word vectors. When two columns are similar, small modifications to the meanings of the associated words have similar effects on the language as a whole. That is what it means, in the context of this model, for two words to be similar. Line integrals through the Hessian field in a given neighborhood can also be approximated by adding and subtracting these vectors, giving a more precise interpretation to the formulas used to represent analogies. Analogies are valid when they correspond to two different line integrals through a conservative Hessian tensor field, beginning at the same point and ending close to the same point, and therefore having similar final values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Implementation</head><p>Constructing a practical implementation of this model requires that we determine the values for two sets of parameters: the potential for each word in the vocabulary, and the energy level for each sentence. The simplest approach to this problem is to set all potential terms to zero, and all energy terms to one. The covariance matrix that results from these choices is identical to the one given by directly counting word cooccurrences. Alternative schemes will change the weights given to each of the sentences, yielding a modified covariance matrix that is likely to give better meaning representations. For performance reasons, some form of dimension reduction is also necessary, but has no theoretical significance at all. In practice, random projection (as in <ref type="bibr" target="#b24">[25]</ref>) works well, especially after implementing some of the preprocessing and hyperparameter selection recommendations in <ref type="bibr" target="#b15">[16]</ref>, which may be compensating for the deficiencies that result from setting the energy and potential terms to constants.</p><p>The problem of selecting energy and potential terms in a more principled way is left to other work. But it is worth considering briefly, since it illustrates some interesting properties of the model. First, in this model, the same sequence of words could appear twice with different energy levels, and therefore different probabilities, depending on context. Second, there may be a way to make predictions that link the semantic potential of given terms to known lexicographical properties of those terms, such as the approximate number of senses the word has. And finally, the partition function described here is not the only possible partition function that might be applied to language. Partition functions based on word pairs, sequences, or even attention mechanisms could be used to model language within this framework, all broadly interpretable in the same way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Few of the ideas presented here are new. The fact that word vectors contain distributional information that allows them to measure word similarity has been known for decades. Ideas from statistical mechanics have been applied to language modeling, machine learning, and information retrieval problems for decades. And for the last few years, there has been a steady stream of work demonstrating that language model X reduces to language model Y in some limit. But none of this has shown how these models could capture information about interpretation or meaning. Trained on linguistic form alone, these models have no evidence showing how linguistic forms map to mental models, concepts, narratives, or any other representations of things outside of language.</p><p>The claim that statistical models can infer things about meaning from linguistic form alone thus faces a high burden of proof. And while there has been a proliferation of models that do appear to support that claim, they all work on slightly different principles, and produce slightly different results. This undercuts attempts at meta-induction; many small bodies of evidence based on different principles of operation do not add up to one large body of evidence. And so justified skepticism remains.</p><p>What is new about the model proposed here is that it is general enough to explain the success of many of these models without reference to the details of their operation. Fundamentally, any model that is able to predict linguistic cooccurrences can be reinterpreted as an implicit partition function along the lines proposed here. So reinterpreted, we can argue that distributional information about language is linked by a precise mathematical structure to specific facts about how words signify. Those facts are limited; they do not include any information about what words, sentences, or longer fragments of language talk about. But they do include information about how many interpretations might be applied to those units of language, and how those interpretations correlate with one another at a macroscopic level.</p><p>What unites all of these models, under this theory, is that they effectively assume that meaning, quantified appropriately, is conserved, and that units of language-be they letters, words, n-grams, or longer phrases-are also conserved. It's not yet clear what these assumptions might mean in linguistic terms, but they are crucial to the derivation of a partition function that can relate the statistics of linguistic form to an unknown reservoir of meaning.</p><p>These models must also make a third assumption: language exists in a state of equilibrium with its reservoir of meaning. That assumption is unlikely to hold in general. If this way of thinking about language modeling is sound, then an important project will be to understand when the assumption of equilibrium is justified, and when it is not. It's likely that during periods of rapid linguistic change, for example, the equilibrium assumption will not be valid. In that case, methods that can model far-from-equilibrium systems will be required. Since non-equilibrium thermodynamics is a field still in its infancy <ref type="bibr" target="#b5">[6]</ref>, there will be much work to be done, and many tasks that remain impossible without domain expertise. Nonetheless, a deeper understanding of the meaning of these assumptions promises to clarify when and how language models can infer meaning from linguistic form alone.</p></div>		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Evaluating the Stability of Embedding-based Word Similarities</title>
		<author>
			<persName><forename type="first">M</forename><surname>Antoniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mimno</surname></persName>
		</author>
		<ptr target="https://transacl.org/ojs/index.php/tacl/article/view/1202" />
	</analytic>
	<monogr>
		<title level="j">TACL</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="107" to="119" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A Latent Variable Model Approach to PMI-based Word Embeddings</title>
		<author>
			<persName><forename type="first">S</forename><surname>Arora</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00106</idno>
		<ptr target="https://www.aclweb.org/anthology/Q16-1028" />
	</analytic>
	<monogr>
		<title level="j">TACL</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="385" to="399" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Algorithmic thermodynamics</title>
		<author>
			<persName><forename type="first">J</forename><surname>Baez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stay</surname></persName>
		</author>
		<idno type="DOI">10.1017/S0960129511000521</idno>
	</analytic>
	<monogr>
		<title level="j">Mathematical Structures in Computer Science</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="771" to="787" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Bender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Koller</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.acl-main.463" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2020-07">July 2020</date>
			<biblScope unit="page" from="5185" to="5198" />
		</imprint>
	</monogr>
	<note>ACL 2020</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Language Models are Few-Shot Learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<idno>CoRR abs/2005.14165</idno>
		<ptr target="https://arxiv.org/abs/2005.14165" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Dissipative adaptation in driven self-assembly</title>
		<author>
			<persName><forename type="first">J</forename><surname>England</surname></persName>
		</author>
		<idno type="DOI">10.1038/nnano.2015.250</idno>
	</analytic>
	<monogr>
		<title level="j">Nature Nanotechnology</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="919" to="923" />
			<date type="published" when="2015-11">Nov. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A synopsis of linguistic theory 1930-55</title>
		<author>
			<persName><forename type="first">J</forename><surname>Firth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Studies in linguistic analysis. The Philological Society</title>
				<meeting><address><addrLine>Oxford</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1957">1957</date>
			<biblScope unit="page" from="1" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Vector Semantics, William Empson, and the Study of Ambiguity</title>
		<author>
			<persName><forename type="first">M</forename><surname>Gavin</surname></persName>
		</author>
		<idno type="DOI">10.1086/698174</idno>
	</analytic>
	<monogr>
		<title level="j">Critical Inquiry</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="641" to="673" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn&apos;t</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gladkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drozd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Matsuoka</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/n16-2002</idno>
		<ptr target="https://doi.org/10.18653/v1/n16-2002" />
	</analytic>
	<monogr>
		<title level="m">SRW@HLT-NAACL 2016</title>
		<title level="s">The Association for Computational Linguistics</title>
		<meeting><address><addrLine>San Diego California, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">June 12-17, 2016. 2016</date>
			<biblScope unit="page" from="8" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">L</forename><surname>Hamilton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leskovec</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P16-1141</idno>
		<ptr target="https://www.aclweb.org/anthology/P16-1141" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>Berlin, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-08">Aug. 2016</date>
			<biblScope unit="page" from="1489" to="1501" />
		</imprint>
	</monogr>
	<note>ACL 2016</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Distributional structure</title>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">S</forename><surname>Harris</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Word</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">2-3</biblScope>
			<biblScope unit="page" from="146" to="162" />
			<date type="published" when="1954">1954</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Word Vectors in the Eighteenth Century</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Heuser</surname></persName>
		</author>
		<ptr target="https://dh2017.adho.org/abstracts/582/582.pdf" />
	</analytic>
	<monogr>
		<title level="m">of Digital Humanities Organizations (ADHO)</title>
		<title level="s">Conference Abstracts</title>
		<editor>
			<persName><forename type="first">R</forename><surname>Lewis</surname></persName>
		</editor>
		<meeting><address><addrLine>Montréal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">August 8-11, 2017. 2017</date>
			<biblScope unit="page" from="256" to="259" />
		</imprint>
	</monogr>
	<note>DH 2017</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">What Does BERT Learn about the Structure of Language?</title>
		<author>
			<persName><forename type="first">G</forename><surname>Jawahar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Seddah</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P19-1356</idno>
		<ptr target="https://www.aclweb.org/anthology/P19-1356" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-07">July 2019</date>
			<biblScope unit="page" from="3651" to="3657" />
		</imprint>
	</monogr>
	<note>ACL 2019</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Semantic information, autonomous agency and nonequilibrium statistical physics</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kolchinsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Wolpert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Interface Focus</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page">20180041</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Neural Word Embedding as Implicit Matrix Factorization</title>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Goldberg</surname></persName>
		</author>
		<ptr target="http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization" />
	</analytic>
	<monogr>
		<title level="m">NIPS 2014</title>
				<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">December 8-13 2014. 2014</date>
			<biblScope unit="page" from="2177" to="2185" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Improving Distributional Similarity with Lessons Learned from Word Embeddings</title>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Goldberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Dagan</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00134</idno>
		<ptr target="https://www.aclweb.org/anthology/Q15-1016" />
	</analytic>
	<monogr>
		<title level="j">TACL</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="211" to="225" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Semantic algorithms</title>
		<author>
			<persName><forename type="first">M</forename><surname>Masterman</surname></persName>
		</author>
		<idno type="DOI">10.1017/CBO9780511486609.012</idno>
	</analytic>
	<monogr>
		<title level="m">Language, Cohesion and Form. Studies in Natural Language Processing</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="253" to="280" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">XI.-Words</title>
		<author>
			<persName><forename type="first">M</forename><surname>Masterman</surname></persName>
		</author>
		<author>
			<persName><surname>Braithwaite</surname></persName>
		</author>
		<idno type="DOI">10.1093/aristotelian/54.1.209</idno>
		<ptr target="https://academic.oup.com/aristotelian/article-pdf/54/1/209/5256573/aristotelian54-0209.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Aristotelian Society</title>
				<meeting>the Aristotelian Society</meeting>
		<imprint>
			<date type="published" when="2015-07">July 2015</date>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="209" to="232" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mccoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pavlick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Linzen</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P19-1334</idno>
		<ptr target="https://www.aclweb.org/anthology/P19-1334" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-07">July 2019</date>
			<biblScope unit="page" from="3428" to="3448" />
		</imprint>
	</monogr>
	<note>ACL 2019</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Efficient Estimation of Word Representations in Vector Space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/1301.3781" />
	</analytic>
	<monogr>
		<title level="m">Workshop Track Proceedings</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</editor>
		<meeting><address><addrLine>Scottsdale, Arizona, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">May 2-4, 2013. 2013</date>
		</imprint>
	</monogr>
	<note>ICLR 2013</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The strange geometry of skip-gram with negative sampling</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Mimno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Thompson</surname></persName>
		</author>
		<ptr target="https://aclanthology.info/papers/D17-1308/d17-1308" />
	</analytic>
	<monogr>
		<title level="m">EMNLP 2017</title>
				<meeting><address><addrLine>Copenhagen, Denmark</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">September 9-11, 2017. 2017</date>
			<biblScope unit="page" from="2873" to="2878" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Probing Neural Network Comprehension of Natural Language Arguments</title>
		<author>
			<persName><forename type="first">T</forename><surname>Niven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-Y</forename><surname>Kao</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P19-1459</idno>
		<ptr target="https://www.aclweb.org/anthology/P19-1459" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-07">July 2019</date>
			<biblScope unit="page" from="4658" to="4664" />
		</imprint>
	</monogr>
	<note>ACL 2019</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Glove: Global Vectors for Word Representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<ptr target="http://aclweb.org/anthology/D/D14/D14-1162.pdf" />
	</analytic>
	<monogr>
		<title level="m">A meeting of SIGDAT, a Special Interest Group of the ACL</title>
				<meeting><address><addrLine>Doha, Qatar</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">October 25-29, 2014. 2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
	<note>EMNLP 2014</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Deep Contextualized Word Representations</title>
		<author>
			<persName><forename type="first">M</forename><surname>Peters</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N18-1202</idno>
		<ptr target="https://www.aclweb.org/anthology/N18-1202" />
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<meeting><address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-06">June 2018</date>
			<biblScope unit="page" from="2227" to="2237" />
		</imprint>
	</monogr>
	<note>HLT-NAACL 2018</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries</title>
		<author>
			<persName><forename type="first">B</forename><surname>Schmidt</surname></persName>
		</author>
		<idno type="DOI">10.22148/16.025</idno>
		<ptr target="https://culturalanalytics.org/article/11033" />
	</analytic>
	<monogr>
		<title level="j">Journal of Cultural Analytics</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Minds, Brains, and Programs</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Searle</surname></persName>
		</author>
		<idno type="DOI">10.1017/s0140525x00005756</idno>
	</analytic>
	<monogr>
		<title level="j">Behavioral and Brain Sciences</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="417" to="457" />
			<date type="published" when="1980">1980</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Modeling Documents with Deep Boltzmann Machines</title>
		<author>
			<persName><forename type="first">N</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<ptr target="https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1%5C&amp;smnu=2%5C&amp;article%5C_id=2423%5C&amp;proceeding%5C_id=29" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 2013</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Nicholson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Smyth</surname></persName>
		</editor>
		<meeting>the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 2013<address><addrLine>Bellevue, WA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AUAI Press</publisher>
			<date type="published" when="2013">August 11-15, 2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Statistical mechanics of letters in words</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Stephens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Bialek</surname></persName>
		</author>
		<idno type="DOI">10.1103/physreve.81.066119</idno>
		<ptr target="http://dx.doi.org/10.1103/PhysRevE.81.066119" />
	</analytic>
	<monogr>
		<title level="j">Physical Review E</title>
		<idno type="ISSN">1550-2376</idno>
		<imprint>
			<biblScope unit="volume">81</biblScope>
			<biblScope unit="issue">6</biblScope>
			<date type="published" when="2010-06">June 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">Attention Is All You Need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<idno>CoRR abs/1706.03762</idno>
		<ptr target="http://arxiv.org/abs/1706.03762" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Translation&quot;. In: Machine translation of languages: fourteen essays</title>
		<author>
			<persName><forename type="first">W</forename><surname>Weaver</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1955">1955</date>
			<publisher>MIT and Wiley</publisher>
			<biblScope unit="page" from="15" to="23" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
