=Paper= {{Paper |id=Vol-2167/paper9 |storemode=property |title=Hyperdimensional utterance spaces |pdfUrl=https://ceur-ws.org/Vol-2167/paper9.pdf |volume=Vol-2167 |authors=Jussi Karlgren,Pentti Kanerva |dblpUrl=https://dblp.org/rec/conf/desires/KarlgrenK18 }} ==Hyperdimensional utterance spaces== https://ceur-ws.org/Vol-2167/paper9.pdf
                                Hyperdimensional Utterance Spaces
                       A more transparent representation for situated language understanding

                          Jussi Karlgren                                                   Pentti Kanerva
       Gavagai & KTH Royal Institute of Technology,                      Redwood Center for Theoretical Neuroscience, UC
                  Stockholm, Sweden                                                         Berkeley
                     jussi@kth.se                                                 pkanerva@csli.stanford.edu
ABSTRACT                                                                 This proposed model for representing linguistic information
Human language has a large and varying number of fea-                    will allow the handy and transparent representation of lin-
tures, both lexical items and constructions, which interact              guistic items beyond the single word and extra-linguistic data
to represent various aspects of communicative information.               to enrich the linguistic signal in a computationally conve-
High-dimensional semantic spaces have proven useful and                  nient representation similar to what is currently the standard
effective for aggregating and processing lexical information             processing model for word-level information.
for many language processing tasks. This paper describes
a hyperdimensional processing model for language data, a                 1.1    Requirements for a representation
straightforward extension of models previously used for words
                                                                         There are some basic qualities we want a representation
to handling utterance or text level information. A hyperdimen-
                                                                         to hold to. A representation should have descriptive and
sional model is able to represent a broad range of linguistic
                                                                         explanatory power, be practical and convenient for further
and extra-linguistic features in a common integral framework
                                                                         application, be reasonably true to human performance, pro-
which is suitable as a bridge between symbolic and contin-
                                                                         vide defaults to smooth over situations where a language
uous representations, as an encoding scheme for symbolic
                                                                         processing component lacks knowledge or data, and provide
information and as a basis for feature space exploration. This
                                                                         constraints where the decision space is too broad.
paper provides an overview of the framework and an example
of how it is used in a pilot experiment.                                    Neurophysiological plausibility We want the model
                                                                              to be non-compiling, i.e. not need a separate step to ac-
CCS CONCEPTS                                                                  commodate a new batch of data. We want it to exhibit
ˆ Information systems  Content analysis and fea-                             bounded growth, not to grow too rapidly with new
ture selection; ˆ Computing methodologies  Knowl-                            data (but not necessarily to be built to accommodate
edge representation and reasoning; Natural language                           very large amounts of data that are implausible).
processing;                                                                 Behavioural adequacy We want the model to be in-
                                                                             cremental, i.e. to improve its performance (however we
KEYWORDS                                                                      choose to measure and evaluate performance) progres-
hyperdimensional computing, utterance-level semantics, con-                   sively with incoming data. While we want our model
structional grammar, knowledge representation and reasoning                   to rely on the surface form of the input, we do not
                                                                              acknowledge the necessity to limit the input analysis
1    REPRESENTING LANGUAGE                                                    to be white-space based tokenisation: a more sophis-
                                                                              ticated model based on the identification of patterns
Language is a general-purpose representation of human knowl-
                                                                              or constructions in the input is as plausible as a naive
edge, and models to process it vary in the degree they are
                                                                              one. We want our representation to allow for explicit
bound to some task or some specific usage. Currently, the
                                                                              inclusion of analysis results beyond the word by word
trend is to learn regularities and representations with as little
                                                                              sequences typically used as input to today’s models.
explicit knowledge-based linguistic processing as possible,
                                                                            Computational habitability We want the model to
and recent advances in such general models for end-to-end
                                                                              be evaluable and transparent, and manageable com-
learning to address linguistics tasks have been quite success-
                                                                              putationally in face of large and growing amounts of
ful. Most of those approaches make little use of information
                                                                              input data it is exposed to. We do not want it to make
beyond the occurrence or co-occurrence of words in the lin-
                                                                              assumptions of a finite inventory of lexical items or
guistic signal and take the single word to be the atomary unit.
                                                                              expressions.
Jussi Karlgren’s work was done as a visiting scholar at the Department      Explicit representation of features We want the model
of Linguistics at Stanford University, supported by a generous Vinnmer        to allow exploration by the explicit inclusion of fea-
Marie Curie grant from Vinnova, the Swedish Governmental Agency
for Innovation Systems.                                                       tures of potential interest, without requiring expensive
                                                                              recalculation and reconfiguration of the model.
                                                                            Context and Anchoring We want the model allow the
© 2018 Copyright held by the author(s).
DESIRES 2018, August 2018, Bertinoro, Italy                                   inclusion of extra-linguistic data and annotations. Lin-
                                                                              guistic data is now available in new configurations,
DESIRES 2018, August 2018, Bertinoro, Italy                                                    Jussi Karlgren and Pentti Kanerva


      collected from situations which allow the explicit cap-            identically with lexical items—the words—and their
      ture of location, time, participants, and other sensory            configurations—the syntax, both being linguistic items
      data such as biometric data, meteorological data, and              with equal salience and presence in the linguistic signal.
      social context of the author or speaker. These data are            The parsimonious character of construction grammar
      potentially of great interest e.g. to resolve ambiguities          in its most radical formulations [1, e.g] is attractive as
      or to understand anaphor and deictic reference and                 a framework for integrating a dynamic and learning
      should not be represented separately from the linguistic           view of language use with formal expression of lan-
      signal.                                                            guage structure: it allows the representation of words
                                                                         together with constructions in a common framework.
1.2    Theoretical basis                                                 For our purposes construction grammar gives a theoret-
                                                                         ical foundation to a consolidated representation of both
We base our work on linguistic data on a vector space model
                                                                         individiual items in utterances and their configuration.
of distributional semantics, incorporating constructional lin-
guistic items together with and similarly to the way we
incorporate lexical elements.
   Distributional semantics Distributional semantics is
                                                                   2    HYPERDIMENSIONAL
     based on well-established philosophical and linguistic
     principles, most clearly formulated by Zellig Harris               COMPUTING
    ([1968]). Distributional semantic models aggregate ob-         We present here the general framework for hyperdimensional
     servations of items in linguistic data and infer semantic     computing, which has been used i.a. for modelling lexical
     similarity between linguistic items based on the simi-        meaning in language, and outline an extension of it to handle
     larity of their observed distributions. The idea is that      more constructional linguistic items.
     if linguistic items — such as e.g. the words herring             The style of computing discussed here was first described by
     and cheese — tend to occur in the same contexts —             Plate [10, 11] and called Holographic Reduced Representation.
     say, in the vicinity of the word smörgåsbord — then         The idea is to compute with high-dimensional vectors or
     we can assume that they have related meanings. This           hypervectors [5] using operations that do not modify vector
     is known as the distributional hypothesis. [13] Distri-       dimensionality during the course of operation and use. We
     butional methods have gained tremendous interest in           use 2,000-dimensional vectors in these demonstrations and
     the past decades, due to the proliferation of large text      experiments.
     streams and new data-oriented learning computational             Information encoded into a hypervector is distributed over
     paradigms which are able to process large amounts of          all vector elements, hence “holographic.” Computing begins
     data. So far distributional methods have mostly been          by assigning random seed vectors for basic objects. In working
     used for lexical tasks, and include fairly little of more     with text, for example, each word in the vocabulary can be
     sophisticated processing as input. This is, to a great        represented by a seed vector, also called the word’s index
     extent, a consequence of the simple and attractively          vector or random label. These seed vectors remain unchanged
     transparent representations used. This paper proposes         throughout computations. We may use two kinds of seed
     a model to accommodate both simple and more com-              vectors consisting of 0s, 1s and −1s, sparse and dense. The
     plex linguistic items with the same representation.           elements of sparse vectors are mostly 0s, the dense vectors
   Vector space models for meaning Vector space mod-               have no 0s. In both sparse and dense vectors, 1s and −1s are
     els are frequently used in information access, both for       equally probable (our sparse seed vectors have 10 of each)
     research experiments and as a building block for sys-         and thus the vectors have mean = 0.
     tems in practical use at least since the early 1970’s. [15]      Representations of more complex objects are computed
     Vector space models have attractive qualities: process-       from the seed vectors with three operations. Two correspond
     ing vector spaces is a manageable implementational            to addition and multiplication of scalar numbers. Addition is
     framework, they are mathematically well-defined and           ordinary vector addition, possibly weighted and normalized.
     understood, and they are intuitively appealing, con-          Multiplication is performed elementwise, also known as the
     forming to everyday metaphors such as “near in mean-          Hadamard product. The third basic operation is permutation,
     ing.” [17] The vector space model for meaning is the          which reorders (scrambles) vector coordinates. The number
     basis for most all information retrieval experimenta-         of possible permutations is enormous.
     tion and implementation, most machine learning ex-               One further operation on vectors measures their similarity.
     periments, and is now the standard approach in most           We use the cosine, with values between −1 and and 1. A
     categorisation schemes, topic models, deep learning           vector is maximally similar to itself and yields a cosine =
     models, and other similar approaches, including this          1. Cosine = 0 means that the two vectors are orthogonal
     present model.                                                and appear to have no information in common. A system
   Construction Grammar The Construction grammar                   for computing with hypervectors also will need to include a
     framework is characterised by the central claims that         memory for such vectors.
     linguistic information is encoded similarly or even
Hyperdimensional Utterance Spaces                                                    DESIRES 2018, August 2018, Bertinoro, Italy


2.1    Computing with Hypervectors                                 they also distribute over multiplication. Permutations provide
The following is a somewhat more formal overview of prop-          a means to represent sequences and nested structure. For
erties of hypervectors used in this paper. They are readily        example, the sequence (𝑎, 𝑏, 𝑐) can be encoded as a sum
seen in dense (seed) vectors 𝐴, 𝐵, 𝐶, . . . of equally probable                      𝑆3 = Π(Π(𝐴)) + Π(𝐵) + 𝐶
1s and −1s.
                                                                                        = Π2 (𝐴) + Π(𝐵) + 𝐶
   Distribution (dot product, cosine): Two vectors taken at
random are dissimilar; they are approximately orthogonal—          or as a product
quasiorthogonal, cosine close to 0. The number of quasiorthog-                         𝑃3 = Π2 (𝐴) * Π(𝐵) * 𝐶
onal vectors grows exponentially with dimensionality.
   Addition (+) of vectors produces a vector that is similar       and extended to include 𝑑 by 𝑆4 = Π(𝑆3 ) + 𝐷 or by 𝑃4 =
to the inputs, e.g.,                                               Π(𝑃3 ) * 𝐷. The inverse permutation Π−1 can then be used to
                                                                   find out, for example, the second vector in 𝑆3 or what comes
                       𝐴+𝐵+𝐶 ∼𝐴                                    after 𝐴 and before 𝐶 in 𝑃3 . If the pair (𝑎, 𝑏) is encoded with
A sum of vectors is increasingly similar to the vectors it is      two unrelated permutations Π1 and Π2 as Π1 (𝐴) + Π2 (𝐵)
a sum of if they are repeatedly added into it, and decreases       then the nested structure ((𝑎, 𝑏), (𝑐, 𝑑)) can be represented
with the number of dissimilar or unrelated vectors added into      by
it. This provides a convenient way of collecting observational              Π1 (Π1 (𝐴) + Π2 (𝐵)) + Π2 (Π1 (𝐶) + Π2 (𝐷))
data for e.g. distributional semantics.
                                                                             = Π11 (𝐴) + Π12 (𝐵) + Π21 (𝐶) + Π22 (𝐶)
   Multiplication (*) of vectors produces a vector that is
dissimilar to the inputs, e.g.,                                    where Π𝑖𝑗 is the permutation Π𝑖 Π𝑗 .
                                                                      The power of computing with numbers follows from the
                          𝐴 * 𝐵 ̸∼ 𝐴
                                                                   fact that addition and multiplication form an algebraic struc-
However, multiplication preserves similarity: the distance         ture called a field. We can expect computing with (dense)
between 𝑄 * 𝐴 and 𝑄 * 𝐵 equals the distance between 𝐴 and          hypervectors to be equally powerful because addition and
𝐵.                                                                 multiplication approximate a field and are complemented by
   A vector of ±1s multiplied by itself produces a vector of       permutations that interact in useful ways with addition and
1s                                                                 multiplication.
                        𝐴*𝐴=1
which means that the vector is its own inverse. That makes         3     HYPERDIMENSIONAL
multiplication convenient for variable binding: variable 𝑥               COMPUTING APPLIED TO
bound to value 𝑎—i.e., {𝑥 = 𝑎}—can be encoded by 𝑋 * 𝐴,                  LINGUISTIC DATA: RANDOM
and the value can be recovered from the bound pair by                    INDEXING
multiplication:
                                                                   The operations presented above have been used in the Ran-
          𝑋 * (𝑋 * 𝐴) = (𝑋 * 𝑋) * 𝐴 = 1 * 𝐴 = 𝐴                    dom Indexing approach to implement distributional semantics
   Multiplication distributes over addition, just as in ordinary   for language identification and to represent the meaning of
arithmetic, because vectors are added and multiplied element-      words in a word space model.
wise:
                                                                   3.1     Language Identification
𝑄 * (𝐴 + 𝐵 + 𝐶 + . . .) = (𝑄 * 𝐴) + (𝑄 * 𝐵) + (𝑄 * 𝐶) + . . .      In a recent experiment, high-dimensional vectors were used
As a consequence, the value of 𝑥 can be recovered approxi-         to represent properties of an entire text in a single text vector
mately from the sum of several variable–value pairs, such as       in order to identify the language it was written in. [4] The
{𝑥 = 𝑎, 𝑦 = 𝑏, 𝑧 = 𝑐}:                                             text vector of an unknown text sample was compared for
                                                                   similarity to precomputed language vectors assembled from
      𝑋 * ((𝑋 * 𝐴) + (𝑌 * 𝐵) + (𝑍 * 𝐶))
                                                                   processing known language samples. Character counts are
      = (𝑋 * (𝑋 * 𝐴)) + (𝑋 * (𝑌 * 𝐵)) + (𝑋 * (𝑍 * 𝐶))              known to be a fairly good indicator of language. [9] This
      = 𝑋 + (𝑋 * 𝑌 * 𝐵) + (𝑋 * 𝑍 * 𝐶)                              model used frequencies of character sequences of length 𝑛—
      = 𝑋 + noise                                                  𝑛-grams—observed in the text: as an example, the text “a
                                                                   book” gives rise to the trigrams “a b”, “ bo”, “boo”, and
      ∼𝑋
                                                                   “ook”. For arbitrary alphabets of 𝐿 letters, there would be
   Elementwise multiplication is useful with dense vectors         (𝐿 + 1)𝑛 𝑛-grams to keep track of; in the case of English,
but not with sparse vectors because the product of sparse          which uses an alphabet of 26 letters (plus Space) this means
vectors is usually a vector of 0s, and because a sparse vector     keeping track of 273 = 19,683 different trigram frequencies.
has no inverse.                                                    These numbers grow quickly as the window size 𝑛 increases.
   Random permutations (Π𝑖 ) resemble multiplication: they            In this experiment, each character was given a randomly
produce a vector that is dissimilar to the input, they pre-        generated index vector, and each window was represented
serve similarity, are invertible, and distribute over addition;    by a (componentwise) product of its letter vectors permuted
DESIRES 2018, August 2018, Bertinoro, Italy                                                       Jussi Karlgren and Pentti Kanerva


according to their place in the window. For example, the           inclusion of structural characteristics of an utterance, beyond
trigram “boo” (from “book”) was encoded into a window              what is observable from lexical statistics.
vector as 𝑉𝑤 = Π2 (𝐵) * Π(𝑂) * 𝑂. The text vector is then             The operators described in Section 2 can be understood
formed by adding together the vectors for all the windows in       as a Vector Symbolic Architecture [2], and can be used to
the text:                                                          conveniently represent the many levels of abstraction in lin-
                                 ∑︁
                  text vector =       𝑉𝑤                           guistic data. Suggestions to combine data in a tensor model
                                  𝑤∈text                           [16, e.g.] are to some extent similar with respect to represen-
   The text vector for a text is then compared to previously       tational adequacy and power but are much more laborious
similarly computed language vectors using cosine as a dis-         computationally. A feature of interest can be included in
tance measure. The language whose language vector shows            the representation by overlaying it using addition and sep-
the highest cosine to the text vector is assumed to be the         arated from others through permutation or multiplication,
language the text sample is written in. In the experiment,         depending on how retrievable it is intended to be. A feature
tri- and tetragram yielded language identification accuracy        of interest can be included as a combination of other features,
(using the EUROPARL corpus) of more than 97%.                      or independently of others.
                                                                      Posit an utterance such as given in Example (1).
3.2    Lexical Semantic Space                                      (1)    Dogs chew bones.
To build a word space model to represent distributional data
of words, a text sample is scanned one word at time. Each          This can be represented—as in typical search engine applications—
word in turn is the focus word of the scan with a context          as a combination of the vectors 𝑣¯ representing the lexical
window of 𝑘 preceding and succeeding words. When a word            items in play. Those representations can be random index
appears for the first time, it is assigned a randomly generated    vectors or context vectors or word embeddings obtained from
sparse index vector and an initially empty context vector          previous analyses. The representation of the utterance can
of equal dimensionality. The index vectors of the words in         then most simply be achieved through simply adding the
the context window are added into the focus word’s context         vectors for each participating lexical item together as in
vector. The resulting context vectors capture “bag-of-words”       Equation 1.
semantics of the focus word, reflecting their similarity of use.
If the text sample is large enough, the similarity measure                                𝑣¯𝑑𝑜𝑔 + 𝑣¯𝑐ℎ𝑒𝑤 + 𝑣¯𝑏𝑜𝑛𝑒                    (1)
captures meaningful intuitions of term similarity.                    Of course, in most practical applications, some information-
   Parameters for these models are most importantly word           ally relevant scalar weighting of the items might be motivated.
weighting, to assess how notable an observation of some
word is or the size of the context, which ranges from entire
”documents”, such as in Latent Semantic Analysis [6, 7] to                   𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒          (2)
local contexts of the 2 to 3 closest words. A broader context         This simple lexical representation can easily be extended
captures topical or associative similarity of words where a        to accommodate more elaborate information. Observing that
more local context captures paradigmatic or replaceability         the utterance in question is in present tense, we can simply
relations such as synonymy or antonymy and syntagmatic             add that information in by adding a vector which is ran-
combinability relation such as attributive or predicative re-      domly generated to represent that observable feature of the
lations. A window size of 2 or 3 words before and after a          utterance. To collect tense information and keep it separate
focus word has been found to yield good results as measured        and separately retrievable from the lexical information, it
by synonym tests and other similar lexical semantic tasks          can be permuted with a designated also randomly generated
[12] and various methods for weighting the observed occur-         permutation for tense-related information. This information
rences have been tested to achieve on-line learning without        can be added in and ∏︀later retrieved by similarity matching
becoming too sensitive to infrequent anomalous occurrences.        of vectors, using the 𝑡𝑒𝑛𝑠𝑒 permutation as a filter.
   A step toward capturing structure has been taken by treat-
ing the words before the focus word differently from the
words after. This is done by permuting their labels differently     𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒 + Π𝑡𝑒𝑛𝑠𝑒 𝑣¯𝑝𝑟𝑒𝑠𝑒𝑛𝑡 (3)
before adding them into the context vector. This use of per-          If it interests us to keep track of qualities of referents, we
mutation includes information about the preceding context          might want to add information about the morphological state
and succeeding context separately, but allows for both to          of the word in question or the character of some of those
be used as an aggregate similarity measure for a word. [14]        referents. In Example 4 we could e.g. note that the referent
Several current neural network models use similar insights         to the agent of the clause (“dogs”) is animate and that this
to represent sequential information.                               referent is in plural number. This information can be added
                                                                   in and later retrieved by similarity matching of vectors, using
3.3    Structure in linguistic data                                a dedicated permutation such as Π𝑎𝑔𝑒𝑛𝑡 and vectors such as
The context vectors computed in random indexing reflect            𝑣¯𝑎𝑛𝑖𝑚𝑎𝑡𝑒 and 𝑣¯𝑝𝑙𝑢𝑟𝑎𝑙 together with 𝑣¯𝑑𝑜𝑔 , the representation
semantic similarity of words but do not easily allow for the       of dog.
Hyperdimensional Utterance Spaces                                                     DESIRES 2018, August 2018, Bertinoro, Italy


                                                                        The size of the representation does not grow with the num-
                                                                     ber of feature vectors aggregated into the state vector, except
           𝑤𝑑𝑜𝑔 · 𝑣¯𝑑𝑜𝑔 + 𝑤𝑐ℎ𝑒𝑤 · 𝑣¯𝑐ℎ𝑒𝑤 + 𝑤𝑏𝑜𝑛𝑒 · 𝑣¯𝑏𝑜𝑛𝑒 +
                                                                     by density of the state vector. Neither does the number of
    Π𝑡𝑒𝑛𝑠𝑒 𝑣¯𝑝𝑟𝑒𝑠𝑒𝑛𝑡 + Π𝑎𝑔𝑒𝑛𝑡 (¯
                               𝑣𝑎𝑛𝑖𝑚𝑎𝑡𝑒 + 𝑣¯𝑝𝑙𝑢𝑟𝑎𝑙 + 𝑣¯𝑑𝑜𝑔 )   (4)   potential features — the size of the lexicon and the combined
                                      ∘   ′ ′′   ∘ ′
                         Π𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 (41 24 5 𝑁 2 9 23 𝐸)′′            size of all potentially interesting constructions — occasion
                                                                     more than linear growth for the system in its entirety.
  This principle enables the representation to harbour many
types of information regardless of how it has been generated         5    EMPIRICAL VALIDATION
or extracted from an utterance, but all aggregated in the
                                                                     We intend to validate our approaches by applying selected
same hyperdimensional space to be used for downstream
                                                                     technologies to tasks which require understanding of linguis-
processing in ways that the representation does not need to
                                                                     tic content. Typical tasks we exepect to address are more
take into account at perception time.
                                                                     advanced language understanding tasks such as authorship
                                                                     and genre identification, author profiling, attitude and sen-
4    QUANTITATIVE                                                    timent analysis, viewpoint analysis, and topic detection and
     CHARACTERISTICS                                                 tracking as well as some more theory-internal tasks such as
The general task for a knowledge representation is to provide        semantic role labeling. The tasks we are interested and where
features which with to represent observed characteristics of         holographic representations are most obviously useful are
interest of some situation in the world, to aggregate such           such where a broad range of features is necessary to perform
observable features of some situation of interest into a rep-        well in them, and where performance is constrained by the
resented state, to allow further processes to verify if an ob-       challenge of incorporating information on several levels of
servation has been recorded in that state, and to decompose          abstraction simultaneously in an integrated processing model.
a state representation into the separate features which have            Current experimentation with this model at present is
composed it.                                                         focussed on authorship profiling which relies on features
   The advantage of using a holographic rather than a localist       beyond the lexical and on tracking social media posts on
model is that for a 𝑑-dimensional representation, where a            meteorological events which is highly temporally determined
one-hot model allows for 𝑑 features with lossless aggrega-           and locational in nature.
tion and retrieval from a state, the variation space of the             A simple sentence level task demonstrated here is that
hyperdimensional approach affords by virtue of the random            of question classification: assigning a category to a question
patterns, permutations, and multiplications a vastly larger          which will impose a semantic constraint on the answer. Ex-
feature palette. This makes possible, as shown above in Sec-         ample categories are ”Human, individual”, ”Number, date”,
tion 3, the representation of an entire vocabulary and their         ”Location, city” [8]. This task forms a basis for many ques-
cooccurrence statistics to be handily accommodated in a              tion answering systems, and a fair amount of effort has been
2,000-dimensional space.                                             put into optimising performance in it. The most important
   The aggregation of features into a state is done simply by        features for this task, as for many language tasks, are lexical:
vector addition, occasionally using a permutation to separate        the words ”How long” in the question ”How long is the Coney
aspects of a feature. Verifying if a feature is present in a state   Island boardwalk?” indicate that the expected answer is a
vector is done using most simply by a dot product or a cosine        number, such as a distance or a time period. Even fairly
similarity measure. The choice of 𝑑, the dimensionality of the       naive word frequency methods achieve around 50% accuracy
representation, determines the capacity of the space. As can         and with some multi-word terms and dependencies added,
be expected, a larger dimensionality allows greater capacity:        optimised systems yield at least 80% accuracy.
a 100-dimensional space can store less information, i.e. fewer          We do not attempt here to push the envelope for this data
distinct features, for each state than a 2,000-dimensional           set — this task is today considered to be solved and super-
does. If we wish to aggregate 𝑁 (near)-orthogonal features           seded by more elaborate tasks in question answering. We use
by addition into a state vector, their √︀relative cosine distance    this to demonstrate how our representation allows the addi-
to that resulting state vector will be (1/𝑁 ). The expected          tion of more information in a fixed dimensionality without
size of 𝑁 determines how large 𝑑 must be chosen to be to             perturbing simpler models in the same representation.
ensure that the cosine is at a safe margin from the noise               We used a set of 5,500 previously labelled questions to
threshold occasioned by the randomisation procedure. If a            train a model which incorporates all words in the model
state vector is expected to hold on the order of 100 unweighted      together with some semantic roles those words enter into. We
feature vectors, the a resulting relative cosine between each        selected a small set of roles which we expect to be of interest
feature vector and the state vector will be 0.1 on average. In       for the purpose of inferring the target entity of the question:
a 1,000-dimensional space, this is about three to four times         the question word itself (Who, How, What, etc), the tense
the noise threshold; in a 2,000-dimensional space about five         of the question main verb, the subject of the main verb, any
to six times from the noise threshold. The graphs in Figure 1        other clause adverbial, and the binary feature of negation or
show how the noise threshold compares across some typical            not. The labels were given on two levels of granularity, coarse-
dimensionality settings.                                             grained (“Human”, “Location”, “Number”, “Description”,
DESIRES 2018, August 2018, Bertinoro, Italy                                                      Jussi Karlgren and Pentti Kanerva


             0.4

             0.2

               0

           −0.2

Figure 1: Cosine of feature vectors to state vector compared to random unrelated vectors in 100, 500, 1,000,
and 2,000 dimensions


“Entity”, “Abbreviation”) and fine-grained (47 subcategories                         coarse-grained         fine-grained
 of the coarse-grained categories). In one condition, each label                      (6 categories)      (47 categories)
was given a state vector with each of the words from every                         av rank accuracy av rank accuracy
 question it has been applied to; in another, semantic roles                       correct by %         correct by %
were added together with the words that participated in                            label                label
 them, in each case permuted for the role in question; in a                                    Words-only space
 third condition both were added together. A test set of 500            words only 1.91      60.2       4.84       57.4
 questions were then used to evaluate the models, by building                              Combined semantic space
 similar vectors for each question and selecting the label with         words only 1.92      60.2       5.88       56.8
 the greatest cosine similarity to it as a candidate label. This    semantic roles 1.91      58.0       6.78       60.2
was done using                                                       roles + words 1.85      66.2       4.64       62.4
    The state vector of a question is then composed additively     Table 1: Question classification results, average rank
 of the bag of words of surface tokens, and for each role of       of correct answer and percentage of items with cor-
 interest, the lemma form of the word permuted by a role-          rect answer at rank 1
 specific permutation. Each category is represented by a sum
 of all state vectors for questions in the training set labeled
with that category. Each test item is then used in three
 conditions: a state vector formed from only bags of words,
 one with only roles, and one with both. These then matched           A hyperdimensional representation can work in conjunc-
 to the category vectors, and the category whose vector shows      tion with other representations: it can act as a bridge between
 the closest cosine distance is used for evaluation. This was      symbolic and continuous models, by accepting symbolic data
 done for the three conditions: lexical items only, semantic       and thus be to serve as an encoder for e.g. neural models,
 roles only, and the combination of the two. The wlexical          embeddings-based models, or other approximative classifi-
 items-only condition was tested both on a semantic space          cation schemes. It allows the seamless and explicit addition
 built using lexical items only and one using both lexical items   to and recovery from a representation of arbitarily complex
 and semantic roles. Table 5 shows the results, most notably       and abstract features. A model with an accessible symbolic
 that the disturbance from the more elaborate model does not       representation together with continous data allows for choice
 appreciably change the results from the simpler model: in         and preference to be represented simultaneously, and allows
 this example only a slight reranking of three questions in the    for objective functions for learning to be represented on levels
 output for the lexical model caused it to lose 0.6 percentage     of abstraction salient and relevant to task at hand.
 points of precision while allowing the more sophisticated            In addition, the memory footprint of a hyperdimensional
 representations holographically concurrently with it.             model based on random indexing is habitable: an overly
                                                                   large amount of data leads to saturation of the model and a
6    CONCLUSIONS AND RELATION TO                                   graceful degradation of performance, not to memory overflow.
                                                                   We expect to see how models based on this approach can be
     OTHER APPROACHES
                                                                   used not only for experimentation but, since their designed
Human language has a large and varying number of fea-              is principled and based on generality and accessibility, for
tures, both lexical items and constructions, which interact to     application purposes.
represent referential and relational content, communicative
intentions of the speaker, situational references, discourse
                                                                   REFERENCES
structure, and many others. A hyperdimensional represen-
                                                                    [1] William Croft. 2005. Radical and typological arguments for radi-
tation is eminently suitable for representing language, to              cal construction grammar. In Construction Grammars: Cognitive
aggregate and handle the wealth of linguistic data and its              grounding and theoretical extensions, Jan-Ola Östman and Mir-
range of each in themselves weak features. It also seamlessly           jam Fried (Eds.). John Benjamins, Amsterdam.
                                                                    [2] Ross W Gayler. 2004. Vector symbolic architectures answer Jack-
accommodates information from outside the linguistic signal             endoff’s challenges for cognitive neuroscience. arXiv preprint
in the same representation.                                             cs/0412059 (2004).
Hyperdimensional Utterance Spaces                                           DESIRES 2018, August 2018, Bertinoro, Italy


 [3] Zellig Harris. 1968. Mathematical structures of language. Inter-
     science Publishers.
 [4] Aditya Joshi, Johan T Halseth, and Pentti Kanerva. 2016. Lan-
     guage geometry using random indexing. In International Sympo-
     sium on Quantum Interaction. Springer, 265–274.
 [5] Pentti Kanerva. 2009. Hyperdimensional computing: An intro-
     duction to computing in distributed representation with high-
     dimensional random vectors. Cognitive Computation 1, 2 (2009),
     139–159.
 [6] Pentii Kanerva, Jan Kristoferson, and Anders Holst. 2000. Ran-
     dom indexing of text samples for latent semantic analysis. In
     Proceedings of the Cognitive Science Society, Vol. 1.
 [7] Thomas K Landauer and Susan T Dumais. 1997. A solution to
     Plato’s problem: The latent semantic analysis theory of acquisition,
     induction, and representation of knowledge. Psychological review
     104, 2 (1997).
 [8] Xin Li and Dan Roth. 2002. Learning question classifiers. In Pro-
     ceedings of the 19th international conference on Computational
     linguistics (COLING). International Committee for Computa-
     tional Linguistics.
 [9] Seppo Mustonen. 1965. Multiple discriminant analysis in linguistic
     problems. Statistical Methods in Linguistics 4 (1965), 37–44.
[10] Tony A Plate. 1995. Holographic reduced representations. IEEE
     Transactions on Neural networks 6, 3 (1995), 623–641.
[11] Tony A Plate. 2003. Holographic Reduced Representation: Dis-
     tributed representation for cognitive structures. Number 150 in
     CSLI Lecture notes. CSLI Publications.
[12] Magnus Sahlgren. 2006. The Word-Space Model: Using distri-
     butional analysis to represent syntagmatic and paradigmatic
     relations between words in high-dimensional vector spaces. PhD
     Dissertation. Department of Linguistics, Stockholm University.
[13] Magnus Sahlgren. 2008. The distributional hypothesis. Rivista
     di Linguistica (Italian Journal of Linguistics) 20 (2008), 33–53.
     Issue 1.
[14] Magnus Sahlgren, Anders Holst, and Pentti Kanerva. 2008. Per-
     mutations as a means to encode order in word space. In The 30th
     Annual Meeting of the Cognitive Science Society (CogSci’08).
[15] Gerard Salton, A. Wong, and C. S. Yang. 1975. A vector space
     model for automatic indexing. Commun. ACM 18, 11 (1975),
     613–620. DOI:http://dx.doi.org/10.1145/361219.361220
[16] Fredrik Sandin, Blerim Emruli, and Magnus Sahlgren. 2017. Ran-
     dom indexing of multidimensional data. Knowledge and Infor-
     mation Systems 52, 1 (2017), 267–290.
[17] Hinrich Schütze. 1993. Word Space. In Proceedings of the 1993
     Conference on Advances in Neural Information Processing Sys-
     tems, NIPS’93. Morgan Kaufmann Publishers Inc., San Francisco,
     CA, USA, 895–902.