Hyperdimensional Utterance Spaces A more transparent representation for situated language understanding Jussi Karlgren Pentti Kanerva Gavagai & KTH Royal Institute of Technology, Redwood Center for Theoretical Neuroscience, UC Stockholm, Sweden Berkeley jussi@kth.se pkanerva@csli.stanford.edu ABSTRACT This proposed model for representing linguistic information Human language has a large and varying number of fea- will allow the handy and transparent representation of lin- tures, both lexical items and constructions, which interact guistic items beyond the single word and extra-linguistic data to represent various aspects of communicative information. to enrich the linguistic signal in a computationally conve- High-dimensional semantic spaces have proven useful and nient representation similar to what is currently the standard effective for aggregating and processing lexical information processing model for word-level information. for many language processing tasks. This paper describes a hyperdimensional processing model for language data, a 1.1 Requirements for a representation straightforward extension of models previously used for words There are some basic qualities we want a representation to handling utterance or text level information. A hyperdimen- to hold to. A representation should have descriptive and sional model is able to represent a broad range of linguistic explanatory power, be practical and convenient for further and extra-linguistic features in a common integral framework application, be reasonably true to human performance, pro- which is suitable as a bridge between symbolic and contin- vide defaults to smooth over situations where a language uous representations, as an encoding scheme for symbolic processing component lacks knowledge or data, and provide information and as a basis for feature space exploration. This constraints where the decision space is too broad. paper provides an overview of the framework and an example of how it is used in a pilot experiment. Neurophysiological plausibility We want the model to be non-compiling, i.e. not need a separate step to ac- CCS CONCEPTS commodate a new batch of data. We want it to exhibit ˆ Information systems  Content analysis and fea- bounded growth, not to grow too rapidly with new ture selection; ˆ Computing methodologies  Knowl- data (but not necessarily to be built to accommodate edge representation and reasoning; Natural language very large amounts of data that are implausible). processing; Behavioural adequacy We want the model to be in- cremental, i.e. to improve its performance (however we KEYWORDS choose to measure and evaluate performance) progres- hyperdimensional computing, utterance-level semantics, con- sively with incoming data. While we want our model structional grammar, knowledge representation and reasoning to rely on the surface form of the input, we do not acknowledge the necessity to limit the input analysis 1 REPRESENTING LANGUAGE to be white-space based tokenisation: a more sophis- ticated model based on the identification of patterns Language is a general-purpose representation of human knowl- or constructions in the input is as plausible as a naive edge, and models to process it vary in the degree they are one. We want our representation to allow for explicit bound to some task or some specific usage. Currently, the inclusion of analysis results beyond the word by word trend is to learn regularities and representations with as little sequences typically used as input to today’s models. explicit knowledge-based linguistic processing as possible, Computational habitability We want the model to and recent advances in such general models for end-to-end be evaluable and transparent, and manageable com- learning to address linguistics tasks have been quite success- putationally in face of large and growing amounts of ful. Most of those approaches make little use of information input data it is exposed to. We do not want it to make beyond the occurrence or co-occurrence of words in the lin- assumptions of a finite inventory of lexical items or guistic signal and take the single word to be the atomary unit. expressions. Jussi Karlgren’s work was done as a visiting scholar at the Department Explicit representation of features We want the model of Linguistics at Stanford University, supported by a generous Vinnmer to allow exploration by the explicit inclusion of fea- Marie Curie grant from Vinnova, the Swedish Governmental Agency for Innovation Systems. tures of potential interest, without requiring expensive recalculation and reconfiguration of the model. Context and Anchoring We want the model allow the © 2018 Copyright held by the author(s). DESIRES 2018, August 2018, Bertinoro, Italy inclusion of extra-linguistic data and annotations. Lin- guistic data is now available in new configurations, DESIRES 2018, August 2018, Bertinoro, Italy Jussi Karlgren and Pentti Kanerva collected from situations which allow the explicit cap- identically with lexical items—the words—and their ture of location, time, participants, and other sensory configurations—the syntax, both being linguistic items data such as biometric data, meteorological data, and with equal salience and presence in the linguistic signal. social context of the author or speaker. These data are The parsimonious character of construction grammar potentially of great interest e.g. to resolve ambiguities in its most radical formulations [1, e.g] is attractive as or to understand anaphor and deictic reference and a framework for integrating a dynamic and learning should not be represented separately from the linguistic view of language use with formal expression of lan- signal. guage structure: it allows the representation of words together with constructions in a common framework. 1.2 Theoretical basis For our purposes construction grammar gives a theoret- ical foundation to a consolidated representation of both We base our work on linguistic data on a vector space model individiual items in utterances and their configuration. of distributional semantics, incorporating constructional lin- guistic items together with and similarly to the way we incorporate lexical elements. Distributional semantics Distributional semantics is 2 HYPERDIMENSIONAL based on well-established philosophical and linguistic principles, most clearly formulated by Zellig Harris COMPUTING ([1968]). Distributional semantic models aggregate ob- We present here the general framework for hyperdimensional servations of items in linguistic data and infer semantic computing, which has been used i.a. for modelling lexical similarity between linguistic items based on the simi- meaning in language, and outline an extension of it to handle larity of their observed distributions. The idea is that more constructional linguistic items. if linguistic items — such as e.g. the words herring The style of computing discussed here was first described by and cheese — tend to occur in the same contexts — Plate [10, 11] and called Holographic Reduced Representation. say, in the vicinity of the word smörgåsbord — then The idea is to compute with high-dimensional vectors or we can assume that they have related meanings. This hypervectors [5] using operations that do not modify vector is known as the distributional hypothesis. [13] Distri- dimensionality during the course of operation and use. We butional methods have gained tremendous interest in use 2,000-dimensional vectors in these demonstrations and the past decades, due to the proliferation of large text experiments. streams and new data-oriented learning computational Information encoded into a hypervector is distributed over paradigms which are able to process large amounts of all vector elements, hence “holographic.” Computing begins data. So far distributional methods have mostly been by assigning random seed vectors for basic objects. In working used for lexical tasks, and include fairly little of more with text, for example, each word in the vocabulary can be sophisticated processing as input. This is, to a great represented by a seed vector, also called the word’s index extent, a consequence of the simple and attractively vector or random label. These seed vectors remain unchanged transparent representations used. This paper proposes throughout computations. We may use two kinds of seed a model to accommodate both simple and more com- vectors consisting of 0s, 1s and −1s, sparse and dense. The plex linguistic items with the same representation. elements of sparse vectors are mostly 0s, the dense vectors Vector space models for meaning Vector space mod- have no 0s. In both sparse and dense vectors, 1s and −1s are els are frequently used in information access, both for equally probable (our sparse seed vectors have 10 of each) research experiments and as a building block for sys- and thus the vectors have mean = 0. tems in practical use at least since the early 1970’s. [15] Representations of more complex objects are computed Vector space models have attractive qualities: process- from the seed vectors with three operations. Two correspond ing vector spaces is a manageable implementational to addition and multiplication of scalar numbers. Addition is framework, they are mathematically well-defined and ordinary vector addition, possibly weighted and normalized. understood, and they are intuitively appealing, con- Multiplication is performed elementwise, also known as the forming to everyday metaphors such as “near in mean- Hadamard product. The third basic operation is permutation, ing.” [17] The vector space model for meaning is the which reorders (scrambles) vector coordinates. The number basis for most all information retrieval experimenta- of possible permutations is enormous. tion and implementation, most machine learning ex- One further operation on vectors measures their similarity. periments, and is now the standard approach in most We use the cosine, with values between −1 and and 1. A categorisation schemes, topic models, deep learning vector is maximally similar to itself and yields a cosine = models, and other similar approaches, including this 1. Cosine = 0 means that the two vectors are orthogonal present model. and appear to have no information in common. A system Construction Grammar The Construction grammar for computing with hypervectors also will need to include a framework is characterised by the central claims that memory for such vectors. linguistic information is encoded similarly or even Hyperdimensional Utterance Spaces DESIRES 2018, August 2018, Bertinoro, Italy 2.1 Computing with Hypervectors they also distribute over multiplication. Permutations provide The following is a somewhat more formal overview of prop- a means to represent sequences and nested structure. For erties of hypervectors used in this paper. They are readily example, the sequence (𝑎, 𝑏, 𝑐) can be encoded as a sum seen in dense (seed) vectors 𝐴, 𝐵, 𝐶, . . . of equally probable 𝑆3 = Π(Π(𝐴)) + Π(𝐵) + 𝐶 1s and −1s. = Π2 (𝐴) + Π(𝐵) + 𝐶 Distribution (dot product, cosine): Two vectors taken at random are dissimilar; they are approximately orthogonal— or as a product quasiorthogonal, cosine close to 0. The number of quasiorthog- 𝑃3 = Π2 (𝐴) * Π(𝐵) * 𝐶 onal vectors grows exponentially with dimensionality. Addition (+) of vectors produces a vector that is similar and extended to include 𝑑 by 𝑆4 = Π(𝑆3 ) + 𝐷 or by 𝑃4 = to the inputs, e.g., Π(𝑃3 ) * 𝐷. The inverse permutation Π−1 can then be used to find out, for example, the second vector in 𝑆3 or what comes 𝐴+𝐵+𝐶 ∼𝐴 after 𝐴 and before 𝐶 in 𝑃3 . If the pair (𝑎, 𝑏) is encoded with A sum of vectors is increasingly similar to the vectors it is two unrelated permutations Π1 and Π2 as Π1 (𝐴) + Π2 (𝐵) a sum of if they are repeatedly added into it, and decreases then the nested structure ((𝑎, 𝑏), (𝑐, 𝑑)) can be represented with the number of dissimilar or unrelated vectors added into by it. This provides a convenient way of collecting observational Π1 (Π1 (𝐴) + Π2 (𝐵)) + Π2 (Π1 (𝐶) + Π2 (𝐷)) data for e.g. distributional semantics. = Π11 (𝐴) + Π12 (𝐵) + Π21 (𝐶) + Π22 (𝐶) Multiplication (*) of vectors produces a vector that is dissimilar to the inputs, e.g., where Π𝑖𝑗 is the permutation Π𝑖 Π𝑗 . The power of computing with numbers follows from the 𝐴 * 𝐵 ̸∼ 𝐴 fact that addition and multiplication form an algebraic struc- However, multiplication preserves similarity: the distance ture called a field. We can expect computing with (dense) between 𝑄 * 𝐴 and 𝑄 * 𝐵 equals the distance between 𝐴 and hypervectors to be equally powerful because addition and 𝐵. multiplication approximate a field and are complemented by A vector of ±1s multiplied by itself produces a vector of permutations that interact in useful ways with addition and 1s multiplication. 𝐴*𝐴=1 which means that the vector is its own inverse. That makes 3 HYPERDIMENSIONAL multiplication convenient for variable binding: variable 𝑥 COMPUTING APPLIED TO bound to value 𝑎—i.e., {𝑥 = 𝑎}—can be encoded by 𝑋 * 𝐴, LINGUISTIC DATA: RANDOM and the value can be recovered from the bound pair by INDEXING multiplication: The operations presented above have been used in the Ran- 𝑋 * (𝑋 * 𝐴) = (𝑋 * 𝑋) * 𝐴 = 1 * 𝐴 = 𝐴 dom Indexing approach to implement distributional semantics Multiplication distributes over addition, just as in ordinary for language identification and to represent the meaning of arithmetic, because vectors are added and multiplied element- words in a word space model. wise: 3.1 Language Identification 𝑄 * (𝐴 + 𝐵 + 𝐶 + . . .) = (𝑄 * 𝐴) + (𝑄 * 𝐵) + (𝑄 * 𝐶) + . . . In a recent experiment, high-dimensional vectors were used As a consequence, the value of 𝑥 can be recovered approxi- to represent properties of an entire text in a single text vector mately from the sum of several variable–value pairs, such as in order to identify the language it was written in. [4] The {𝑥 = 𝑎, 𝑦 = 𝑏, 𝑧 = 𝑐}: text vector of an unknown text sample was compared for similarity to precomputed language vectors assembled from 𝑋 * ((𝑋 * 𝐴) + (𝑌 * 𝐵) + (𝑍 * 𝐶)) processing known language samples. Character counts are = (𝑋 * (𝑋 * 𝐴)) + (𝑋 * (𝑌 * 𝐵)) + (𝑋 * (𝑍 * 𝐶)) known to be a fairly good indicator of language. [9] This = 𝑋 + (𝑋 * 𝑌 * 𝐵) + (𝑋 * 𝑍 * 𝐶) model used frequencies of character sequences of length 𝑛— = 𝑋 + noise 𝑛-grams—observed in the text: as an example, the text “a book” gives rise to the trigrams “a b”, “ bo”, “boo”, and ∼𝑋 “ook”. For arbitrary alphabets of 𝐿 letters, there would be Elementwise multiplication is useful with dense vectors (𝐿 + 1)𝑛 𝑛-grams to keep track of; in the case of English, but not with sparse vectors because the product of sparse which uses an alphabet of 26 letters (plus Space) this means vectors is usually a vector of 0s, and because a sparse vector keeping track of 273 = 19,683 different trigram frequencies. has no inverse. These numbers grow quickly as the window size 𝑛 increases. Random permutations (Π𝑖 ) resemble multiplication: they In this experiment, each character was given a randomly produce a vector that is dissimilar to the input, they pre- generated index vector, and each window was represented serve similarity, are invertible, and distribute over addition; by a (componentwise) product of its letter vectors permuted DESIRES 2018, August 2018, Bertinoro, Italy Jussi Karlgren and Pentti Kanerva according to their place in the window. For example, the inclusion of structural characteristics of an utterance, beyond trigram “boo” (from “book”) was encoded into a window what is observable from lexical statistics. vector as 𝑉𝑤 = Π2 (𝐵) * Π(𝑂) * 𝑂. The text vector is then The operators described in Section 2 can be understood formed by adding together the vectors for all the windows in as a Vector Symbolic Architecture [2], and can be used to the text: conveniently represent the many levels of abstraction in lin- ∑︁ text vector = 𝑉𝑤 guistic data. Suggestions to combine data in a tensor model 𝑤∈text [16, e.g.] are to some extent similar with respect to represen- The text vector for a text is then compared to previously tational adequacy and power but are much more laborious similarly computed language vectors using cosine as a dis- computationally. A feature of interest can be included in tance measure. The language whose language vector shows the representation by overlaying it using addition and sep- the highest cosine to the text vector is assumed to be the arated from others through permutation or multiplication, language the text sample is written in. In the experiment, depending on how retrievable it is intended to be. A feature tri- and tetragram yielded language identification accuracy of interest can be included as a combination of other features, (using the EUROPARL corpus) of more than 97%. or independently of others. Posit an utterance such as given in Example (1). 3.2 Lexical Semantic Space (1) Dogs chew bones. To build a word space model to represent distributional data of words, a text sample is scanned one word at time. Each This can be represented—as in typical search engine applications— word in turn is the focus word of the scan with a context as a combination of the vectors 𝑣¯ representing the lexical window of 𝑘 preceding and succeeding words. When a word items in play. Those representations can be random index appears for the first time, it is assigned a randomly generated vectors or context vectors or word embeddings obtained from sparse index vector and an initially empty context vector previous analyses. The representation of the utterance can of equal dimensionality. The index vectors of the words in then most simply be achieved through simply adding the the context window are added into the focus word’s context vectors for each participating lexical item together as in vector. The resulting context vectors capture “bag-of-words” Equation 1. semantics of the focus word, reflecting their similarity of use. If the text sample is large enough, the similarity measure 𝑣¯𝑑𝑜𝑔 + 𝑣¯𝑐ℎ𝑒𝑤 + 𝑣¯𝑏𝑜𝑛𝑒 (1) captures meaningful intuitions of term similarity. Of course, in most practical applications, some information- Parameters for these models are most importantly word ally relevant scalar weighting of the items might be motivated. weighting, to assess how notable an observation of some word is or the size of the context, which ranges from entire ”documents”, such as in Latent Semantic Analysis [6, 7] to 𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒 (2) local contexts of the 2 to 3 closest words. A broader context This simple lexical representation can easily be extended captures topical or associative similarity of words where a to accommodate more elaborate information. Observing that more local context captures paradigmatic or replaceability the utterance in question is in present tense, we can simply relations such as synonymy or antonymy and syntagmatic add that information in by adding a vector which is ran- combinability relation such as attributive or predicative re- domly generated to represent that observable feature of the lations. A window size of 2 or 3 words before and after a utterance. To collect tense information and keep it separate focus word has been found to yield good results as measured and separately retrievable from the lexical information, it by synonym tests and other similar lexical semantic tasks can be permuted with a designated also randomly generated [12] and various methods for weighting the observed occur- permutation for tense-related information. This information rences have been tested to achieve on-line learning without can be added in and ∏︀later retrieved by similarity matching becoming too sensitive to infrequent anomalous occurrences. of vectors, using the 𝑡𝑒𝑛𝑠𝑒 permutation as a filter. A step toward capturing structure has been taken by treat- ing the words before the focus word differently from the words after. This is done by permuting their labels differently 𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒 + Π𝑡𝑒𝑛𝑠𝑒 𝑣¯𝑝𝑟𝑒𝑠𝑒𝑛𝑡 (3) before adding them into the context vector. This use of per- If it interests us to keep track of qualities of referents, we mutation includes information about the preceding context might want to add information about the morphological state and succeeding context separately, but allows for both to of the word in question or the character of some of those be used as an aggregate similarity measure for a word. [14] referents. In Example 4 we could e.g. note that the referent Several current neural network models use similar insights to the agent of the clause (“dogs”) is animate and that this to represent sequential information. referent is in plural number. This information can be added in and later retrieved by similarity matching of vectors, using 3.3 Structure in linguistic data a dedicated permutation such as Π𝑎𝑔𝑒𝑛𝑡 and vectors such as The context vectors computed in random indexing reflect 𝑣¯𝑎𝑛𝑖𝑚𝑎𝑡𝑒 and 𝑣¯𝑝𝑙𝑢𝑟𝑎𝑙 together with 𝑣¯𝑑𝑜𝑔 , the representation semantic similarity of words but do not easily allow for the of dog. Hyperdimensional Utterance Spaces DESIRES 2018, August 2018, Bertinoro, Italy The size of the representation does not grow with the num- ber of feature vectors aggregated into the state vector, except 𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒 + by density of the state vector. Neither does the number of Π𝑡𝑒𝑛𝑠𝑒 𝑣¯𝑝𝑟𝑒𝑠𝑒𝑛𝑡 + Π𝑎𝑔𝑒𝑛𝑡 (¯ 𝑣𝑎𝑛𝑖𝑚𝑎𝑡𝑒 + 𝑣¯𝑝𝑙𝑢𝑟𝑎𝑙 + 𝑣¯𝑑𝑜𝑔 ) (4) potential features — the size of the lexicon and the combined ∘ ′ ′′ ∘ ′ Π𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 (41 24 5 𝑁 2 9 23 𝐸)′′ size of all potentially interesting constructions — occasion more than linear growth for the system in its entirety. This principle enables the representation to harbour many types of information regardless of how it has been generated 5 EMPIRICAL VALIDATION or extracted from an utterance, but all aggregated in the We intend to validate our approaches by applying selected same hyperdimensional space to be used for downstream technologies to tasks which require understanding of linguis- processing in ways that the representation does not need to tic content. Typical tasks we exepect to address are more take into account at perception time. advanced language understanding tasks such as authorship and genre identification, author profiling, attitude and sen- 4 QUANTITATIVE timent analysis, viewpoint analysis, and topic detection and CHARACTERISTICS tracking as well as some more theory-internal tasks such as The general task for a knowledge representation is to provide semantic role labeling. The tasks we are interested and where features which with to represent observed characteristics of holographic representations are most obviously useful are interest of some situation in the world, to aggregate such such where a broad range of features is necessary to perform observable features of some situation of interest into a rep- well in them, and where performance is constrained by the resented state, to allow further processes to verify if an ob- challenge of incorporating information on several levels of servation has been recorded in that state, and to decompose abstraction simultaneously in an integrated processing model. a state representation into the separate features which have Current experimentation with this model at present is composed it. focussed on authorship profiling which relies on features The advantage of using a holographic rather than a localist beyond the lexical and on tracking social media posts on model is that for a 𝑑-dimensional representation, where a meteorological events which is highly temporally determined one-hot model allows for 𝑑 features with lossless aggrega- and locational in nature. tion and retrieval from a state, the variation space of the A simple sentence level task demonstrated here is that hyperdimensional approach affords by virtue of the random of question classification: assigning a category to a question patterns, permutations, and multiplications a vastly larger which will impose a semantic constraint on the answer. Ex- feature palette. This makes possible, as shown above in Sec- ample categories are ”Human, individual”, ”Number, date”, tion 3, the representation of an entire vocabulary and their ”Location, city” [8]. This task forms a basis for many ques- cooccurrence statistics to be handily accommodated in a tion answering systems, and a fair amount of effort has been 2,000-dimensional space. put into optimising performance in it. The most important The aggregation of features into a state is done simply by features for this task, as for many language tasks, are lexical: vector addition, occasionally using a permutation to separate the words ”How long” in the question ”How long is the Coney aspects of a feature. Verifying if a feature is present in a state Island boardwalk?” indicate that the expected answer is a vector is done using most simply by a dot product or a cosine number, such as a distance or a time period. Even fairly similarity measure. The choice of 𝑑, the dimensionality of the naive word frequency methods achieve around 50% accuracy representation, determines the capacity of the space. As can and with some multi-word terms and dependencies added, be expected, a larger dimensionality allows greater capacity: optimised systems yield at least 80% accuracy. a 100-dimensional space can store less information, i.e. fewer We do not attempt here to push the envelope for this data distinct features, for each state than a 2,000-dimensional set — this task is today considered to be solved and super- does. If we wish to aggregate 𝑁 (near)-orthogonal features seded by more elaborate tasks in question answering. We use by addition into a state vector, their √︀relative cosine distance this to demonstrate how our representation allows the addi- to that resulting state vector will be (1/𝑁 ). The expected tion of more information in a fixed dimensionality without size of 𝑁 determines how large 𝑑 must be chosen to be to perturbing simpler models in the same representation. ensure that the cosine is at a safe margin from the noise We used a set of 5,500 previously labelled questions to threshold occasioned by the randomisation procedure. If a train a model which incorporates all words in the model state vector is expected to hold on the order of 100 unweighted together with some semantic roles those words enter into. We feature vectors, the a resulting relative cosine between each selected a small set of roles which we expect to be of interest feature vector and the state vector will be 0.1 on average. In for the purpose of inferring the target entity of the question: a 1,000-dimensional space, this is about three to four times the question word itself (Who, How, What, etc), the tense the noise threshold; in a 2,000-dimensional space about five of the question main verb, the subject of the main verb, any to six times from the noise threshold. The graphs in Figure 1 other clause adverbial, and the binary feature of negation or show how the noise threshold compares across some typical not. The labels were given on two levels of granularity, coarse- dimensionality settings. grained (“Human”, “Location”, “Number”, “Description”, DESIRES 2018, August 2018, Bertinoro, Italy Jussi Karlgren and Pentti Kanerva 0.4 0.2 0 −0.2 Figure 1: Cosine of feature vectors to state vector compared to random unrelated vectors in 100, 500, 1,000, and 2,000 dimensions “Entity”, “Abbreviation”) and fine-grained (47 subcategories coarse-grained fine-grained of the coarse-grained categories). In one condition, each label (6 categories) (47 categories) was given a state vector with each of the words from every av rank accuracy av rank accuracy question it has been applied to; in another, semantic roles correct by % correct by % were added together with the words that participated in label label them, in each case permuted for the role in question; in a Words-only space third condition both were added together. A test set of 500 words only 1.91 60.2 4.84 57.4 questions were then used to evaluate the models, by building Combined semantic space similar vectors for each question and selecting the label with words only 1.92 60.2 5.88 56.8 the greatest cosine similarity to it as a candidate label. This semantic roles 1.91 58.0 6.78 60.2 was done using roles + words 1.85 66.2 4.64 62.4 The state vector of a question is then composed additively Table 1: Question classification results, average rank of the bag of words of surface tokens, and for each role of of correct answer and percentage of items with cor- interest, the lemma form of the word permuted by a role- rect answer at rank 1 specific permutation. Each category is represented by a sum of all state vectors for questions in the training set labeled with that category. Each test item is then used in three conditions: a state vector formed from only bags of words, one with only roles, and one with both. These then matched A hyperdimensional representation can work in conjunc- to the category vectors, and the category whose vector shows tion with other representations: it can act as a bridge between the closest cosine distance is used for evaluation. This was symbolic and continuous models, by accepting symbolic data done for the three conditions: lexical items only, semantic and thus be to serve as an encoder for e.g. neural models, roles only, and the combination of the two. The wlexical embeddings-based models, or other approximative classifi- items-only condition was tested both on a semantic space cation schemes. It allows the seamless and explicit addition built using lexical items only and one using both lexical items to and recovery from a representation of arbitarily complex and semantic roles. Table 5 shows the results, most notably and abstract features. A model with an accessible symbolic that the disturbance from the more elaborate model does not representation together with continous data allows for choice appreciably change the results from the simpler model: in and preference to be represented simultaneously, and allows this example only a slight reranking of three questions in the for objective functions for learning to be represented on levels output for the lexical model caused it to lose 0.6 percentage of abstraction salient and relevant to task at hand. points of precision while allowing the more sophisticated In addition, the memory footprint of a hyperdimensional representations holographically concurrently with it. model based on random indexing is habitable: an overly large amount of data leads to saturation of the model and a 6 CONCLUSIONS AND RELATION TO graceful degradation of performance, not to memory overflow. We expect to see how models based on this approach can be OTHER APPROACHES used not only for experimentation but, since their designed Human language has a large and varying number of fea- is principled and based on generality and accessibility, for tures, both lexical items and constructions, which interact to application purposes. represent referential and relational content, communicative intentions of the speaker, situational references, discourse REFERENCES structure, and many others. A hyperdimensional represen- [1] William Croft. 2005. Radical and typological arguments for radi- tation is eminently suitable for representing language, to cal construction grammar. In Construction Grammars: Cognitive aggregate and handle the wealth of linguistic data and its grounding and theoretical extensions, Jan-Ola Östman and Mir- range of each in themselves weak features. It also seamlessly jam Fried (Eds.). John Benjamins, Amsterdam. [2] Ross W Gayler. 2004. Vector symbolic architectures answer Jack- accommodates information from outside the linguistic signal endoff’s challenges for cognitive neuroscience. arXiv preprint in the same representation. cs/0412059 (2004). Hyperdimensional Utterance Spaces DESIRES 2018, August 2018, Bertinoro, Italy [3] Zellig Harris. 1968. Mathematical structures of language. Inter- science Publishers. [4] Aditya Joshi, Johan T Halseth, and Pentti Kanerva. 2016. Lan- guage geometry using random indexing. In International Sympo- sium on Quantum Interaction. Springer, 265–274. [5] Pentti Kanerva. 2009. Hyperdimensional computing: An intro- duction to computing in distributed representation with high- dimensional random vectors. Cognitive Computation 1, 2 (2009), 139–159. [6] Pentii Kanerva, Jan Kristoferson, and Anders Holst. 2000. Ran- dom indexing of text samples for latent semantic analysis. In Proceedings of the Cognitive Science Society, Vol. 1. [7] Thomas K Landauer and Susan T Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review 104, 2 (1997). [8] Xin Li and Dan Roth. 2002. Learning question classifiers. In Pro- ceedings of the 19th international conference on Computational linguistics (COLING). International Committee for Computa- tional Linguistics. [9] Seppo Mustonen. 1965. Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics 4 (1965), 37–44. [10] Tony A Plate. 1995. Holographic reduced representations. IEEE Transactions on Neural networks 6, 3 (1995), 623–641. [11] Tony A Plate. 2003. Holographic Reduced Representation: Dis- tributed representation for cognitive structures. Number 150 in CSLI Lecture notes. CSLI Publications. [12] Magnus Sahlgren. 2006. The Word-Space Model: Using distri- butional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD Dissertation. Department of Linguistics, Stockholm University. [13] Magnus Sahlgren. 2008. The distributional hypothesis. Rivista di Linguistica (Italian Journal of Linguistics) 20 (2008), 33–53. Issue 1. [14] Magnus Sahlgren, Anders Holst, and Pentti Kanerva. 2008. Per- mutations as a means to encode order in word space. In The 30th Annual Meeting of the Cognitive Science Society (CogSci’08). [15] Gerard Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. DOI:http://dx.doi.org/10.1145/361219.361220 [16] Fredrik Sandin, Blerim Emruli, and Magnus Sahlgren. 2017. Ran- dom indexing of multidimensional data. Knowledge and Infor- mation Systems 52, 1 (2017), 267–290. [17] Hinrich Schütze. 1993. Word Space. In Proceedings of the 1993 Conference on Advances in Neural Information Processing Sys- tems, NIPS’93. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 895–902.