<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hyperdimensional Utterance Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jussi Karlgren</string-name>
          <email>jussi@kth.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pentti Kanerva</string-name>
          <email>pkanerva@csli.stanford.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gavagai &amp; KTH Royal Institute of Technology</institution>
          ,
          <addr-line>Stockholm</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Redwood Center for Theoretical Neuroscience</institution>
          ,
          <addr-line>UC, Berkeley</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Human language has a large and varying number of features, both lexical items and constructions, which interact to represent various aspects of communicative information. High-dimensional semantic spaces have proven useful and efective for aggregating and processing lexical information for many language processing tasks. This paper describes a hyperdimensional processing model for language data, a straightforward extension of models previously used for words to handling utterance or text level information. A hyperdimensional model is able to represent a broad range of linguistic and extra-linguistic features in a common integral framework which is suitable as a bridge between symbolic and continuous representations, as an encoding scheme for symbolic information and as a basis for feature space exploration. This paper provides an overview of the framework and an example of how it is used in a pilot experiment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>Information systems Content analysis and
feature selection; Computing methodologies
Knowledge representation and reasoning; Natural language
processing;</p>
    </sec>
    <sec id="sec-2">
      <title>REPRESENTING LANGUAGE</title>
      <p>Language is a general-purpose representation of human
knowledge, and models to process it vary in the degree they are
bound to some task or some specific usage. Currently, the
trend is to learn regularities and representations with as little
explicit knowledge-based linguistic processing as possible,
and recent advances in such general models for end-to-end
learning to address linguistics tasks have been quite
successful. Most of those approaches make little use of information
beyond the occurrence or co-occurrence of words in the
linguistic signal and take the single word to be the atomary unit.
Jussi Karlgren’s work was done as a visiting scholar at the Department
of Linguistics at Stanford University, supported by a generous Vinnmer
Marie Curie grant from Vinnova, the Swedish Governmental Agency
for Innovation Systems.
1.1</p>
    </sec>
    <sec id="sec-3">
      <title>Requirements for a representation</title>
      <p>There are some basic qualities we want a representation
to hold to. A representation should have descriptive and
explanatory power, be practical and convenient for further
application, be reasonably true to human performance,
provide defaults to smooth over situations where a language
processing component lacks knowledge or data, and provide
constraints where the decision space is too broad.</p>
      <p>Neurophysiological plausibility We want the model
to be non-compiling, i.e. not need a separate step to
accommodate a new batch of data. We want it to exhibit
bounded growth, not to grow too rapidly with new
data (but not necessarily to be built to accommodate
very large amounts of data that are implausible).
Behavioural adequacy We want the model to be
incremental, i.e. to improve its performance (however we
choose to measure and evaluate performance)
progressively with incoming data. While we want our model
to rely on the surface form of the input, we do not
acknowledge the necessity to limit the input analysis
to be white-space based tokenisation: a more
sophisticated model based on the identification of patterns
or constructions in the input is as plausible as a naive
one. We want our representation to allow for explicit
inclusion of analysis results beyond the word by word
sequences typically used as input to today’s models.
Computational habitability We want the model to
be evaluable and transparent, and manageable
computationally in face of large and growing amounts of
input data it is exposed to. We do not want it to make
assumptions of a finite inventory of lexical items or
expressions.</p>
      <p>Explicit representation of features We want the model
to allow exploration by the explicit inclusion of
features of potential interest, without requiring expensive
recalculation and reconfiguration of the model.</p>
      <p>Context and Anchoring We want the model allow the
inclusion of extra-linguistic data and annotations.
Linguistic data is now available in new configurations,
collected from situations which allow the explicit
capture of location, time, participants, and other sensory
data such as biometric data, meteorological data, and
social context of the author or speaker. These data are
potentially of great interest e.g. to resolve ambiguities
or to understand anaphor and deictic reference and
should not be represented separately from the linguistic
signal.
1.2</p>
    </sec>
    <sec id="sec-4">
      <title>Theoretical basis</title>
      <p>We base our work on linguistic data on a vector space model
of distributional semantics, incorporating constructional
linguistic items together with and similarly to the way we
incorporate lexical elements.</p>
      <p>
        Distributional semantics Distributional semantics is
based on well-established philosophical and linguistic
principles, most clearly formulated by Zellig Harris
([1968]). Distributional semantic models aggregate
observations of items in linguistic data and infer semantic
similarity between linguistic items based on the
similarity of their observed distributions. The idea is that
if linguistic items — such as e.g. the words herring
and cheese — tend to occur in the same contexts —
say, in the vicinity of the word sm¨org˚asbord — then
we can assume that they have related meanings. This
is known as the distributional hypothesis. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
Distributional methods have gained tremendous interest in
the past decades, due to the proliferation of large text
streams and new data-oriented learning computational
paradigms which are able to process large amounts of
data. So far distributional methods have mostly been
used for lexical tasks, and include fairly little of more
sophisticated processing as input. This is, to a great
extent, a consequence of the simple and attractively
transparent representations used. This paper proposes
a model to accommodate both simple and more
complex linguistic items with the same representation.
Vector space models for meaning Vector space
models are frequently used in information access, both for
research experiments and as a building block for
systems in practical use at least since the early 1970’s. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
Vector space models have attractive qualities:
processing vector spaces is a manageable implementational
framework, they are mathematically well-denfied and
understood, and they are intuitively appealing,
conforming to everyday metaphors such as “near in
meaning.” [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] The vector space model for meaning is the
basis for most all information retrieval
experimentation and implementation, most machine learning
experiments, and is now the standard approach in most
categorisation schemes, topic models, deep learning
models, and other similar approaches, including this
present model.
      </p>
      <p>Construction Grammar The Construction grammar
framework is characterised by the central claims that
linguistic information is encoded similarly or even
identically with lexical items—the words—and their
configurations—the syntax, both being linguistic items
with equal salience and presence in the linguistic signal.
The parsimonious character of construction grammar
in its most radical formulations [1, e.g] is attractive as
a framework for integrating a dynamic and learning
view of language use with formal expression of
language structure: it allows the representation of words
together with constructions in a common framework.
For our purposes construction grammar gives a
theoretical foundation to a consolidated representation of both
individiual items in utterances and their configuration.
2</p>
    </sec>
    <sec id="sec-5">
      <title>HYPERDIMENSIONAL</title>
    </sec>
    <sec id="sec-6">
      <title>COMPUTING</title>
      <p>We present here the general framework for hyperdimensional
computing, which has been used i.a. for modelling lexical
meaning in language, and outline an extension of it to handle
more constructional linguistic items.</p>
      <p>
        The style of computing discussed here was first described by
Plate [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] and called Holographic Reduced Representation.
The idea is to compute with high-dimensional vectors or
hypervectors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] using operations that do not modify vector
dimensionality during the course of operation and use. We
use 2,000-dimensional vectors in these demonstrations and
experiments.
      </p>
      <p>Information encoded into a hypervector is distributed over
all vector elements, hence “holographic.” Computing begins
by assigning random seed vectors for basic objects. In working
with text, for example, each word in the vocabulary can be
represented by a seed vector, also called the word’s index
vector or random label. These seed vectors remain unchanged
throughout computations. We may use two kinds of seed
vectors consisting of 0s, 1s and − 1s, sparse and dense. The
elements of sparse vectors are mostly 0s, the dense vectors
have no 0s. In both sparse and dense vectors, 1s and − 1s are
equally probable (our sparse seed vectors have 10 of each)
and thus the vectors have mean = 0.</p>
      <p>Representations of more complex objects are computed
from the seed vectors with three operations. Two correspond
to addition and multiplication of scalar numbers. Addition is
ordinary vector addition, possibly weighted and normalized.
Multiplication is performed elementwise, also known as the
Hadamard product. The third basic operation is permutation,
which reorders (scrambles) vector coordinates. The number
of possible permutations is enormous.</p>
      <p>One further operation on vectors measures their similarity.
We use the cosine, with values between − 1 and and 1. A
vector is maximally similar to itself and yields a cosine =
1. Cosine = 0 means that the two vectors are orthogonal
and appear to have no information in common. A system
for computing with hypervectors also will need to include a
memory for such vectors.
2.1</p>
    </sec>
    <sec id="sec-7">
      <title>Computing with Hypervectors</title>
      <p>The following is a somewhat more formal overview of
properties of hypervectors used in this paper. They are readily
seen in dense (seed) vectors , , , . . . of equally probable
1s and − 1s.</p>
      <p>Distribution (dot product, cosine): Two vectors taken at
random are dissimilar; they are approximately orthogonal—
quasiorthogonal, cosine close to 0. The number of
quasiorthogonal vectors grows exponentially with dimensionality.</p>
      <p>Addition (+) of vectors produces a vector that is similar
to the inputs, e.g.,</p>
      <p>+  +  ∼ 
A sum of vectors is increasingly similar to the vectors it is
a sum of if they are repeatedly added into it, and decreases
with the number of dissimilar or unrelated vectors added into
it. This provides a convenient way of collecting observational
data for e.g. distributional semantics.</p>
      <p>Multiplication (* ) of vectors produces a vector that is
dissimilar to the inputs, e.g.,
 *  ̸∼ 
 *  = 1
1s
However, multiplication preserves similarity: the distance
between  *  and  *  equals the distance between  and
.</p>
      <p>A vector of ± 1s multiplied by itself produces a vector of
which means that the vector is its own inverse. That makes
multiplication convenient for variable binding: variable 
bound to value —i.e., { = }—can be encoded by  * ,
and the value can be recovered from the bound pair by
multiplication:</p>
      <p>* ( * ) = ( * ) *  = 1 *  =</p>
      <p>Multiplication distributes over addition, just as in ordinary
arithmetic, because vectors are added and multiplied
elementwise:
 * ( +  +  + . . .) = ( * ) + ( * ) + ( * ) + . . .
As a consequence, the value of  can be recovered
approximately from the sum of several variable–value pairs, such as
{ = ,  = ,  = }:
 * (( * ) + ( * ) + ( * ))
= ( * ( * )) + ( * ( * )) + ( * ( * ))
=  + ( *  * ) + ( *  * )
=  + noise
∼</p>
      <p>Elementwise multiplication is useful with dense vectors
but not with sparse vectors because the product of sparse
vectors is usually a vector of 0s, and because a sparse vector
has no inverse.</p>
      <p>Random permutations (Π ) resemble multiplication: they
produce a vector that is dissimilar to the input, they
preserve similarity, are invertible, and distribute over addition;
they also distribute over multiplication. Permutations provide
a means to represent sequences and nested structure. For
example, the sequence (, , ) can be encoded as a sum
or as a product
3 = (ΠΠ(</p>
      <p>)) + Π( ) + 
= Π 2() + Π( ) + 
3 = Π 2() * Π( ) * 
and extended to include  by 4 = Π( 3) +  or by 4 =
Π( 3) * . The inverse permutation Π − 1 can then be used to
ifnd out, for example, the second vector in 3 or what comes
after  and before  in 3. If the pair (, ) is encoded with
two unrelated permutations Π 1 and Π 2 as Π 1() + Π 2()
then the nested structure ((, ), (, )) can be represented
by
Π 1(Π 1() + Π 2()) + Π 2(Π 1() + Π 2())
= Π 11() + Π 12() + Π 21() + Π 22()
where Π  is the permutation Π Π .</p>
      <p>The power of computing with numbers follows from the
fact that addition and multiplication form an algebraic
structure called a field. We can expect computing with (dense)
hypervectors to be equally powerful because addition and
multiplication approximate a field and are complemented by
permutations that interact in useful ways with addition and
multiplication.
3</p>
    </sec>
    <sec id="sec-8">
      <title>HYPERDIMENSIONAL</title>
    </sec>
    <sec id="sec-9">
      <title>COMPUTING APPLIED TO</title>
    </sec>
    <sec id="sec-10">
      <title>LINGUISTIC DATA: RANDOM</title>
    </sec>
    <sec id="sec-11">
      <title>INDEXING</title>
      <p>The operations presented above have been used in the
Random Indexing approach to implement distributional semantics
for language identification and to represent the meaning of
words in a word space model.
3.1</p>
    </sec>
    <sec id="sec-12">
      <title>Language Identification</title>
      <p>
        In a recent experiment, high-dimensional vectors were used
to represent properties of an entire text in a single text vector
in order to identify the language it was written in. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] The
text vector of an unknown text sample was compared for
similarity to precomputed language vectors assembled from
processing known language samples. Character counts are
known to be a fairly good indicator of language. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] This
model used frequencies of character sequences of length —
-grams—observed in the text: as an example, the text “a
book” gives rise to the trigrams “a b”, “ bo”, “boo”, and
“ook”. For arbitrary alphabets of  letters, there would be
( + 1) -grams to keep track of; in the case of English,
which uses an alphabet of 26 letters (plus Space) this means
keeping track of 273 = 19,683 diferent trigram frequencies.
These numbers grow quickly as the window size  increases.
      </p>
      <p>In this experiment, each character was given a randomly
generated index vector, and each window was represented
by a (componentwise) product of its letter vectors permuted
according to their place in the window. For example, the
trigram “boo” (from “book”) was encoded into a window
vector as  = Π 2() * Π( ) * . The text vector is then
formed by adding together the vectors for all the windows in
the text:
text vector =</p>
      <p>∑︁ 
∈text</p>
      <p>The text vector for a text is then compared to previously
similarly computed language vectors using cosine as a
distance measure. The language whose language vector shows
the highest cosine to the text vector is assumed to be the
language the text sample is written in. In the experiment,
tri- and tetragram yielded language identification accuracy
(using the EUROPARL corpus) of more than 97%.
3.2</p>
    </sec>
    <sec id="sec-13">
      <title>Lexical Semantic Space</title>
      <p>To build a word space model to represent distributional data
of words, a text sample is scanned one word at time. Each
word in turn is the focus word of the scan with a context
window of  preceding and succeeding words. When a word
appears for the first time, it is assigned a randomly generated
sparse index vector and an initially empty context vector
of equal dimensionality. The index vectors of the words in
the context window are added into the focus word’s context
vector. The resulting context vectors capture “bag-of-words”
semantics of the focus word, reflecting their similarity of use.
If the text sample is large enough, the similarity measure
captures meaningful intuitions of term similarity.</p>
      <p>
        Parameters for these models are most importantly word
weighting, to assess how notable an observation of some
word is or the size of the context, which ranges from entire
”documents”, such as in Latent Semantic Analysis [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] to
local contexts of the 2 to 3 closest words. A broader context
captures topical or associative similarity of words where a
more local context captures paradigmatic or replaceability
relations such as synonymy or antonymy and syntagmatic
combinability relation such as attributive or predicative
relations. A window size of 2 or 3 words before and after a
focus word has been found to yield good results as measured
by synonym tests and other similar lexical semantic tasks
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and various methods for weighting the observed
occurrences have been tested to achieve on-line learning without
becoming too sensitive to infrequent anomalous occurrences.
      </p>
      <p>
        A step toward capturing structure has been taken by
treating the words before the focus word diferently from the
words after. This is done by permuting their labels diferently
before adding them into the context vector. This use of
permutation includes information about the preceding context
and succeeding context separately, but allows for both to
be used as an aggregate similarity measure for a word. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
Several current neural network models use similar insights
to represent sequential information.
3.3
      </p>
    </sec>
    <sec id="sec-14">
      <title>Structure in linguistic data</title>
      <p>The context vectors computed in random indexing reflect
semantic similarity of words but do not easily allow for the
inclusion of structural characteristics of an utterance, beyond
what is observable from lexical statistics.</p>
      <p>
        The operators described in Section 2 can be understood
as a Vector Symbolic Architecture [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and can be used to
conveniently represent the many levels of abstraction in
linguistic data. Suggestions to combine data in a tensor model
[16, e.g.] are to some extent similar with respect to
representational adequacy and power but are much more laborious
computationally. A feature of interest can be included in
the representation by overlaying it using addition and
separated from others through permutation or multiplication,
depending on how retrievable it is intended to be. A feature
of interest can be included as a combination of other features,
or independently of others.
      </p>
      <p>Posit an utterance such as given in Example (1).
(1)</p>
      <p>Dogs chew bones.</p>
      <p>This can be represented—as in typical search engine applications—
as a combination of the vectors ¯ representing the lexical
items in play. Those representations can be random index
vectors or context vectors or word embeddings obtained from
previous analyses. The representation of the utterance can
then most simply be achieved through simply adding the
vectors for each participating lexical item together as in
Equation 1.</p>
      <p>¯ + ¯ℎ + ¯
(1)</p>
      <p>Of course, in most practical applications, some
informationally relevant scalar weighting of the items might be motivated.
 · ¯ + ℎ · ¯ℎ +  · ¯
(2)</p>
      <p>This simple lexical representation can easily be extended
to accommodate more elaborate information. Observing that
the utterance in question is in present tense, we can simply
add that information in by adding a vector which is
randomly generated to represent that observable feature of the
utterance. To collect tense information and keep it separate
and separately retrievable from the lexical information, it
can be permuted with a designated also randomly generated
permutation for tense-related information. This information
can be added in and later retrieved by similarity matching
of vectors, using the ∏︀  permutation as a filter.
 · ¯ + ℎ · ¯ℎ +  · ¯ + Π  ¯ (3)</p>
      <p>If it interests us to keep track of qualities of referents, we
might want to add information about the morphological state
of the word in question or the character of some of those
referents. In Example 4 we could e.g. note that the referent
to the agent of the clause (“dogs”) is animate and that this
referent is in plural number. This information can be added
in and later retrieved by similarity matching of vectors, using
a dedicated permutation such as Π  and vectors such as
¯ and ¯ together with ¯, the representation
of dog.
 · ¯ + ℎ · ¯ℎ +  · ¯+
Π  ¯ + Π (¯ + ¯ + ¯)
(4)</p>
      <p>This principle enables the representation to harbour many
types of information regardless of how it has been generated
or extracted from an utterance, but all aggregated in the
same hyperdimensional space to be used for downstream
processing in ways that the representation does not need to
take into account at perception time.
4</p>
    </sec>
    <sec id="sec-15">
      <title>QUANTITATIVE</title>
    </sec>
    <sec id="sec-16">
      <title>CHARACTERISTICS</title>
      <p>The general task for a knowledge representation is to provide
features which with to represent observed characteristics of
interest of some situation in the world, to aggregate such
observable features of some situation of interest into a
represented state, to allow further processes to verify if an
observation has been recorded in that state, and to decompose
a state representation into the separate features which have
composed it.</p>
      <p>The advantage of using a holographic rather than a localist
model is that for a -dimensional representation, where a
one-hot model allows for  features with lossless
aggregation and retrieval from a state, the variation space of the
hyperdimensional approach afords by virtue of the random
patterns, permutations, and multiplications a vastly larger
feature palette. This makes possible, as shown above in
Section 3, the representation of an entire vocabulary and their
cooccurrence statistics to be handily accommodated in a
2,000-dimensional space.</p>
      <p>The aggregation of features into a state is done simply by
vector addition, occasionally using a permutation to separate
aspects of a feature. Verifying if a feature is present in a state
vector is done using most simply by a dot product or a cosine
similarity measure. The choice of , the dimensionality of the
representation, determines the capacity of the space. As can
be expected, a larger dimensionality allows greater capacity:
a 100-dimensional space can store less information, i.e. fewer
distinct features, for each state than a 2,000-dimensional
does. If we wish to aggregate  (near)-orthogonal features
by addition into a state vector, their relative cosine distance
to that resulting state vector will be √︀(1/ ). The expected
size of  determines how large  must be chosen to be to
ensure that the cosine is at a safe margin from the noise
threshold occasioned by the randomisation procedure. If a
state vector is expected to hold on the order of 100 unweighted
feature vectors, the a resulting relative cosine between each
feature vector and the state vector will be 0.1 on average. In
a 1,000-dimensional space, this is about three to four times
the noise threshold; in a 2,000-dimensional space about five
to six times from the noise threshold. The graphs in Figure 1
show how the noise threshold compares across some typical
dimensionality settings.</p>
      <p>The size of the representation does not grow with the
number of feature vectors aggregated into the state vector, except
by density of the state vector. Neither does the number of
potential features — the size of the lexicon and the combined
size of all potentially interesting constructions — occasion
more than linear growth for the system in its entirety.
5</p>
    </sec>
    <sec id="sec-17">
      <title>EMPIRICAL VALIDATION</title>
      <p>We intend to validate our approaches by applying selected
technologies to tasks which require understanding of
linguistic content. Typical tasks we exepect to address are more
advanced language understanding tasks such as authorship
and genre identification, author profiling, attitude and
sentiment analysis, viewpoint analysis, and topic detection and
tracking as well as some more theory-internal tasks such as
semantic role labeling. The tasks we are interested and where
holographic representations are most obviously useful are
such where a broad range of features is necessary to perform
well in them, and where performance is constrained by the
challenge of incorporating information on several levels of
abstraction simultaneously in an integrated processing model.</p>
      <p>Current experimentation with this model at present is
focussed on authorship profiling which relies on features
beyond the lexical and on tracking social media posts on
meteorological events which is highly temporally determined
and locational in nature.</p>
      <p>
        A simple sentence level task demonstrated here is that
of question classification: assigning a category to a question
which will impose a semantic constraint on the answer.
Example categories are ”Human, individual”, ”Number, date”,
”Location, city” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This task forms a basis for many
question answering systems, and a fair amount of efort has been
put into optimising performance in it. The most important
features for this task, as for many language tasks, are lexical:
the words ”How long” in the question ”How long is the Coney
Island boardwalk?” indicate that the expected answer is a
number, such as a distance or a time period. Even fairly
naive word frequency methods achieve around 50% accuracy
and with some multi-word terms and dependencies added,
optimised systems yield at least 80% accuracy.
      </p>
      <p>We do not attempt here to push the envelope for this data
set — this task is today considered to be solved and
superseded by more elaborate tasks in question answering. We use
this to demonstrate how our representation allows the
addition of more information in a fixed dimensionality without
perturbing simpler models in the same representation.</p>
      <p>We used a set of 5,500 previously labelled questions to
train a model which incorporates all words in the model
together with some semantic roles those words enter into. We
selected a small set of roles which we expect to be of interest
for the purpose of inferring the target entity of the question:
the question word itself (Who, How, What, etc), the tense
of the question main verb, the subject of the main verb, any
other clause adverbial, and the binary feature of negation or
not. The labels were given on two levels of granularity,
coarsegrained (“Human”, “Location”, “Number”, “Description”,
0.4
0.2</p>
      <p>0
− 0.2
“Entity”, “Abbreviation”) and fine-grained (47 subcategories
of the coarse-grained categories). In one condition, each label
was given a state vector with each of the words from every
question it has been applied to; in another, semantic roles
were added together with the words that participated in
them, in each case permuted for the role in question; in a
third condition both were added together. A test set of 500
questions were then used to evaluate the models, by building
similar vectors for each question and selecting the label with
the greatest cosine similarity to it as a candidate label. This
was done using</p>
      <p>The state vector of a question is then composed additively
of the bag of words of surface tokens, and for each role of
interest, the lemma form of the word permuted by a
rolespecific permutation. Each category is represented by a sum
of all state vectors for questions in the training set labeled
with that category. Each test item is then used in three
conditions: a state vector formed from only bags of words,
one with only roles, and one with both. These then matched
to the category vectors, and the category whose vector shows
the closest cosine distance is used for evaluation. This was
done for the three conditions: lexical items only, semantic
roles only, and the combination of the two. The wlexical
items-only condition was tested both on a semantic space
built using lexical items only and one using both lexical items
and semantic roles. Table 5 shows the results, most notably
that the disturbance from the more elaborate model does not
appreciably change the results from the simpler model: in
this example only a slight reranking of three questions in the
output for the lexical model caused it to lose 0.6 percentage
points of precision while allowing the more sophisticated
representations holographically concurrently with it.
6</p>
    </sec>
    <sec id="sec-18">
      <title>CONCLUSIONS AND RELATION TO</title>
    </sec>
    <sec id="sec-19">
      <title>OTHER APPROACHES</title>
      <p>Human language has a large and varying number of
features, both lexical items and constructions, which interact to
represent referential and relational content, communicative
intentions of the speaker, situational references, discourse
structure, and many others. A hyperdimensional
representation is eminently suitable for representing language, to
aggregate and handle the wealth of linguistic data and its
range of each in themselves weak features. It also seamlessly
accommodates information from outside the linguistic signal
in the same representation.
coarse-grained ifne-grained
(6 categories) (47 categories)
av rank accuracy av rank accuracy
correct by % correct by %
label label</p>
      <p>Words-only space
words only 1.91 60.2 4.84 57.4</p>
      <p>Combined semantic space
words only 1.92 60.2 5.88 56.8
semantic roles 1.91 58.0 6.78 60.2
roles + words 1.85 66.2 4.64 62.4
Table 1: Question classification results, average rank
of correct answer and percentage of items with
correct answer at rank 1</p>
      <p>A hyperdimensional representation can work in
conjunction with other representations: it can act as a bridge between
symbolic and continuous models, by accepting symbolic data
and thus be to serve as an encoder for e.g. neural models,
embeddings-based models, or other approximative
classification schemes. It allows the seamless and explicit addition
to and recovery from a representation of arbitarily complex
and abstract features. A model with an accessible symbolic
representation together with continous data allows for choice
and preference to be represented simultaneously, and allows
for objective functions for learning to be represented on levels
of abstraction salient and relevant to task at hand.</p>
      <p>In addition, the memory footprint of a hyperdimensional
model based on random indexing is habitable: an overly
large amount of data leads to saturation of the model and a
graceful degradation of performance, not to memory overoflw.
We expect to see how models based on this approach can be
used not only for experimentation but, since their designed
is principled and based on generality and accessibility, for
application purposes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>William</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Radical and typological arguments for radical construction grammar</article-title>
          . In Construction Grammars:
          <article-title>Cognitive grounding and theoretical extensions, Jan-Ola O¨ stman</article-title>
          and Mirjam Fried (Eds.).
          <source>John Benjamins</source>
          , Amsterdam.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ross</surname>
            <given-names>W</given-names>
          </string-name>
          <string-name>
            <surname>Gayler</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience</article-title>
          .
          <source>arXiv preprint cs/0412059</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zellig</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <year>1968</year>
          .
          <article-title>Mathematical structures of language</article-title>
          . Interscience Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Johan T Halseth</surname>
            , and
            <given-names>Pentti</given-names>
          </string-name>
          <string-name>
            <surname>Kanerva</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Language geometry using random indexing</article-title>
          .
          <source>In International Symposium on Quantum Interaction</source>
          . Springer,
          <fpage>265</fpage>
          -
          <lpage>274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Pentti</given-names>
            <surname>Kanerva</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Hyperdimensional computing: An introduction to computing in distributed representation with highdimensional random vectors</article-title>
          .
          <source>Cognitive Computation 1</source>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>139</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Pentii</given-names>
            <surname>Kanerva</surname>
          </string-name>
          , Jan Kristoferson, and
          <string-name>
            <given-names>Anders</given-names>
            <surname>Holst</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Random indexing of text samples for latent semantic analysis</article-title>
          .
          <source>In Proceedings of the Cognitive Science Society</source>
          , Vol.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Thomas</surname>
            <given-names>K</given-names>
          </string-name>
          <string-name>
            <surname>Landauer and Susan T Dumais</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological review 104</source>
          ,
          <issue>2</issue>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Learning question classifiers</article-title>
          .
          <source>InProceedings of the 19th international conference on Computational linguistics (COLING)</source>
          .
          <source>International Committee for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Seppo</given-names>
            <surname>Mustonen</surname>
          </string-name>
          .
          <year>1965</year>
          .
          <article-title>Multiple discriminant analysis in linguistic problems</article-title>
          .
          <source>Statistical Methods in Linguistics 4</source>
          (
          <year>1965</year>
          ),
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tony</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Plate</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Holographic reduced representations</article-title>
          .
          <source>IEEE Transactions on Neural networks 6</source>
          ,
          <issue>3</issue>
          (
          <year>1995</year>
          ),
          <fpage>623</fpage>
          -
          <lpage>641</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Tony</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Plate</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Holographic Reduced Representation: Distributed representation for cognitive structures</article-title>
          .
          <source>Number 150 in CSLI Lecture notes. CSLI Publications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces</article-title>
          .
          <source>PhD Dissertation</source>
          . Department of Linguistics, Stockholm University.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The distributional hypothesis</article-title>
          .
          <source>Rivista di Linguistica (Italian Journal of Linguistics)</source>
          <volume>20</volume>
          (
          <year>2008</year>
          ),
          <fpage>33</fpage>
          -
          <lpage>53</lpage>
          . Issue 1.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Magnus</surname>
            <given-names>Sahlgren</given-names>
          </string-name>
          , Anders Holst, and
          <string-name>
            <given-names>Pentti</given-names>
            <surname>Kanerva</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Permutations as a means to encode order in word space</article-title>
          .
          <source>In The 30th Annual Meeting of the Cognitive Science Society (CogSci'08).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          ,
          <issue>11</issue>
          (
          <year>1975</year>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          . DOI:http://dx.doi.org/10.1145/361219.361220
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Fredrik</surname>
            <given-names>Sandin</given-names>
          </string-name>
          , Blerim Emruli, and
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Random indexing of multidimensional data</article-title>
          .
          <source>Knowledge and Information Systems</source>
          <volume>52</volume>
          ,
          <issue>1</issue>
          (
          <year>2017</year>
          ),
          <fpage>267</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hinrich</surname>
            <given-names>Schu¨tze. 1993. Word</given-names>
          </string-name>
          <string-name>
            <surname>Space</surname>
          </string-name>
          .
          <source>In Proceedings of the 1993 Conference on Advances in Neural Information Processing Systems</source>
          , NIPS'
          <fpage>93</fpage>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <fpage>895</fpage>
          -
          <lpage>902</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>