<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mapping Images to Psychological Similarity Spaces Using Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Institute of Cognitive Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Osnabruck University</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Osnabruck</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany lucas.bechberger@uni-osnabrueck.de</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Cognition Lab, Open University of Cyprus</institution>
          ,
          <addr-line>Nicosia</addr-line>
          ,
          <country country="CY">Cyprus</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The cognitive framework of conceptual spaces bridges the gap between symbolic and subsymbolic AI by proposing an intermediate conceptual layer where knowledge is represented geometrically. There are two main approaches for obtaining the dimensions of this conceptual similarity space: using similarity ratings from psychological experiments and using machine learning techniques. In this paper, we propose a combination of both approaches by using psychologically derived similarity ratings to constrain the machine learning process. This way, a mapping from stimuli to conceptual spaces can be learned that is both supported by psychological data and allows generalization to unseen stimuli. The results of a rst feasibility study support our proposed approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Conceptual Spaces</kwd>
        <kwd>Multidimensional Scaling</kwd>
        <kwd>Neural Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The cognitive framework of conceptual spaces [
        <xref ref-type="bibr" rid="ref12 ref13">12,13</xref>
        ] attempts to bridge the gap
between symbolic and subsymbolic AI by proposing an intermediate conceptual
layer based on geometric representations. A conceptual space is a similarity
space spanned by a number of quality dimensions representing interpretable
features. Convex sets in this space correspond to concepts. Abstract symbols
can be grounded in perception by linking them to concepts in a conceptual
space whose dimensions are based on subsymbolic representations.
      </p>
      <p>
        The framework of conceptual spaces has been highly in uential in the last
15 years within cognitive science [
        <xref ref-type="bibr" rid="ref10 ref11 ref21">10,11,21</xref>
        ]. It has also sparked considerable
research in various sub elds of arti cial intelligence, ranging from robotics and
computer vision [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to plausible reasoning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The dimensions of a conceptual similarity space are usually obtained based
on either psychological experiments or machine learning methods. The rst
approach has a clear psychological grounding, but is only applicable to items for
which human similarity ratings are available. The second approach, in contrast,
can work on large amounts of unlabeled data, but lacks psychological validity.
In this paper, we propose a combination of both approaches that uses
psychologically derived similarity spaces as a target for machine learning algorithms.
This way, a mapping from stimuli to conceptual spaces can be found that is both
supported by psychological data and able to generalize to unseen stimuli. The
results of a rst feasibility study support our proposed approach.</p>
      <p>The remainder of this paper is structured as follows: Section 2 introduces
the framework of conceptual spaces. Section 3 summarizes techniques for
deriving similarity spaces from psychological experiments and Section 4 describes
two recently proposed types of neural networks from the area of representation
learning. In Section 5, we formulate our proposal of combining arti cial neural
networks with psychological data. Section 6 presents a rst feasibility study for
our approach and Section 7 provides an outlook on future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Conceptual Spaces</title>
      <p>
        A conceptual space [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a similarity space spanned by so-called \quality
dimensions". Each of these dimensions represents an interpretable and cognitively
meaningful way in which two stimuli can be judged to be similar or di erent.
Examples of quality dimensions include temperature, weight, time, pitch, and
hue. A domain is a set of dimensions that inherently belong together. Di erent
perceptual modalities (like color, shape, or taste) are represented by di erent
domains. The color domain for instance can be represented by the three dimensions
hue, saturation, and brightness. Distance within a domain is measured by the
Euclidean metric. The overall conceptual space is de ned as the product space of
all dimensions. Distance within the overall conceptual space is measured by the
Manhattan metric of the intra-domain distances. The similarity of two points in
a conceptual space is inversely related to their distance { the closer two instances
are in the conceptual space, the more similar they are considered to be.
      </p>
      <p>The framework distinguishes properties like \red", \round", and \sweet"
from full- eshed concepts like \apple" or \dog": Properties are represented as
convex sets within individual domains (e.g., color, shape, taste), whereas
fulleshed concepts span multiple domains. Reasoning within a conceptual space
can be done based on geometric relationships (e.g., betweenness and similarity)
and geometric operations (e.g., intersection and projection).</p>
      <p>In his book [12, Chapter 1.7, Chapter 6.5], Gardenfors describes three ways
for identifying the dimensions spanning a conceptual space: Handcrafting,
multidimensional scaling, and machine learning.</p>
      <p>
        Handcrafting is only applicable to domains with a well known structure such
as the temperature domain, which can be described by a single dimension and
measured with a single sensor. However, handcrafting is not possible for more
complex domains (e.g., shapes) that are based on complex sensors (e.g.,
cameras). We therefore exclude it from our further considerations.
If handcrafting is not applicable, one can conduct a psychological experiment
to elicit human similarity ratings for pairs of stimuli and then apply
\multidimensional scaling" (MDS) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is a well-known statistical technique used in
various psychological domains [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. MDS provides a geometric representation of
the stimulus set, where geometric distances between pairs of stimuli re ect their
psychological dissimilarity. We will further describe this approach for deriving a
conceptual space in Section 3.
      </p>
      <p>The third approach for obtaining the dimensions of a conceptual space is to
use machine learning techniques for dimensionality reduction. Gardenfors argues
that raw perceptual input is too rich and too unstructured for direct processing.
It is thus necessary to lift the input to a more economic form of representation,
which typically involves a drastic reduction in the number of dimensions. There
are a variety of dimensionality reduction algorithms in the machine learning
eld, among which arti cial neural networks (ANNs) are a promising candidate
for implementing a multi-layered and non-linear dimensionality reduction. In
Section 4 we will introduce two recently proposed network architectures.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Deriving Conceptual Spaces from Similarity Ratings</title>
      <p>Several data collection methods have been proposed for collecting similarity
judgments from human participants. The traditional technique is called
\dissimilarity ratings". In this approach, all possible pairs from a set of stimuli are
presented to participants (one pair at a time), and participants rate the
dissimilarity of each pair on a continuous or categorical scale. This method is however
quite ine cient, due to the large number of judgments required by humans.</p>
      <p>
        A faster alternative is the \sorting method", where participants are given the
whole set of stimuli and are asked to assign them into groups of similar stimuli.
There are two versions of the sorting method: In \constrained sorting" the
number of groups is pre-de ned by the experimenters, whereas in \free sorting" the
number of groups is at the discretion of each participant. The resulting pairwise
ratings derived from the sorting method are binary { either the two stimuli
belong to the same group or not. Sorting methods, although faster, lack accuracy
compared to dissimilarity ratings [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In a comparative study performed on a
set of 40 stimuli of sounds, dissimilarity ratings were found to be more reliable
and accurate than sorting methods, lacking however in e ciency [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        A more recent technique, named \Spatial Arrangement Method" (SpAM), is
proposed by Goldstone [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: Participants are asked to rate similarity through
a spatial arrangement. The data set, or a part of it, is given in a random
conguration and participants have to rearrange the stimuli in a two-dimensional
space, such that the distance between each pair of stimuli re ects their
dissimilarity. SpAM is both e cient and user-friendly and has been found to produce
high-quality MDS solutions [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, the SpAM method is not widely
applicable, since it is constrained to visual and textual stimuli.
      </p>
      <p>After having elicited psychological similarity ratings, MDS can be used to
convert them into a geometric representation which can be interpreted as a
conceptual space. MDS takes as input the pair-wise average similarity scores for
all pairs of stimuli and the desired number n of dimensions. It then represents
each stimulus as a point in an n-dimensional space and tries to arrange these
points in such a way that the distance of two points accurately re ects the
similarity rating of the stimuli they represent. The optimal number of dimensions
is usually determined by repeatedly running the MDS algorithm with di erent
values of n. The accuracy of each resulting space in representing the original
similarity ratings is calculated. One typically selects a relatively small value n
for which the accuracy is deemed good enough and where the accuracy of using
n + 1 dimensions is not considerably better.</p>
      <p>While providing an elegant way for deriving a conceptual space from
human similarity ratings, this approach does have some disadvantages: First of
all, there is no guarantee that the dimensions generated by the MDS algorithm
are interpretable, so one typically needs to search for interpretable directions in
the generated space. Moreover (and in our opinion more importantly), MDS is
only applicable to a xed number of input stimuli. If a new, previously unseen
stimulus is perceived, MDS does not provide a mapping of this stimulus onto a
point in the similarity space.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Arti cial Neural Networks for Representation Learning</title>
      <p>
        In recent years, there has been substantial work on learning compressed
representations of a given feature space with neural networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We focus our
discussion on two recently proposed approaches that have demonstrated the
potential to extract interpretable dimensions, namely -VAE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and InfoGAN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The left part of Figure 1 shows the typical structure of an autoencoder. The
overall network consists of two parts, an encoder and a decoder. The encoder
takes an input image and compresses it into a low-dimensional \hidden"
representation. This hidden representation is passed on to the decoder, which tries to
reconstruct the original image from it.</p>
      <p>
        While the general autoencoder structure has existed for decades, Kingma
and Welling [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] have recently proposed variational autoencoders (VAE) that
use Gaussian distributions as their hidden representation. The encoder predicts
two numbers for each entry of the hidden representation, which are then
interpreted as the mean and the variance of a Gaussian distribution. The input to the
decoder is sampled from this distribution. The loss term that the network tries
to minimize consists of two parts: (i) the reconstruction error, which measures
the di erence between the original input image and its reconstruction, and (ii)
the regularization term, which measures the di erence between a multivariate
Gaussian distribution with a diagonal covariance matrix and the actual
distribution of the hidden layer. The experimental results of Kingma and Welling
showed that this architecture is able to learn very good reconstructions of the
original input images.
      </p>
      <p>
        Higgins et al [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] have further modi ed this framework by giving a larger
weight to the regularization term than to the reconstruction error. Their
experiments showed that the resulting -VAE network is able to extract meaningful
dimensions from unlabeled data sets because the network has a strong incentive
to learn uncorrelated hidden dimensions. However, due to the smaller emphasis
on the reconstruction error, this improved interpretability of the extracted
dimensions is gained by sacri cing reconstruction accuracy.
      </p>
      <p>
        InfoGAN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (depicted in the right part of Figure 1) is based on the
framework of generative adversarial networks (GAN) and consists of two parts, the
generator and the discriminator. The generator is fed with two low-dimensional
vectors of noise values. Its task is to create high-dimensional data vectors that
have a similar distribution as real data vectors taken from an unlabeled training
set. The discriminator receives a data vector that was either created by the
generator or taken from the training set (e.g., images). Its task is to distinguish real
inputs from generated inputs and to reconstruct one of the noise vectors (the
so called \latent" vector). The overall architecture can be interpreted as a
twoplayer game: The generator tries to fool the discriminator by creating realistic
images and the discriminator tries to avoid being fooled by the generator. Chen
et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] showed that after training an InfoGAN, the latent variables tend to
have an interpretable meaning. For instance, in an experiment using the MNIST
data set, the latent variables corresponded to type of digit, digit rotation and
stroke thickness. In practice, however, InfoGAN and other GAN variants are
relatively di cult to train [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Both of these recent approaches have shown promising early results in
extracting meaningful dimensions from unlabeled data. They can both generalize
to unseen inputs and generate example images from a given hidden/latent
representation. However, choosing the correct size of the hidden/latent representation
is crucial and typically requires either good prior knowledge of the domain or
extensive manual optimization. Neural networks in general require large amounts
of training data, but as both presented approaches only need unlabeled data,
this is not critical in practice. Finally, while the extracted dimensions might be
useful from an AI perspective, they cannot claim any psychological validity.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Our Proposal</title>
      <p>As we have seen in Sections 3 and 4, both \traditional" approaches for
obtaining the dimensions of a conceptual space have their individual strengths and
weaknesses. We are not aware of an approach that can claim both to be
psychologically valid and to generalize well to unseen stimuli. By combining MDS with
neural networks, one can potentially obtain such an approach. We will focus our
subsequent discussion on stimuli in the form of images.
5.1</p>
      <sec id="sec-5-1">
        <title>Proposed Procedure</title>
        <p>
          After having determined the domain of interest (e.g., the domain of shapes), one
rst needs to acquire a data set of stimuli from this domain (e.g., ShapeNet [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]).
This data set should cover a wide variety of stimuli and it should be large enough
for applying machine learning algorithms. Using the whole data set with
thousands of stimuli in a psychological experiment is unfeasible in practice. Therefore,
a relatively small, but still su ciently representative subset of these stimuli needs
to be selected for the elicitation of human similarity ratings.
        </p>
        <p>
          This subset of stimuli is then used in a psychological study where similarity
judgments by humans are obtained, using one of the techniques described in
Section 3. The choice of the data collection method usually depends on the type
of stimuli, the size of the data set and the aims of the study. Choosing the right
method is crucial for the quality of the resulting space [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>One can now apply MDS to the collected similarity judgments in order to
extract a spatial representation of the underlying domain. As stated in Section
3, one needs to select the optimal number of dimensions { either based on prior
knowledge or by manually optimizing the trade-o between high representational
accuracy and a low number of dimensions. From a machine learning perspective,
the stimulus-point mappings obtained from MDS can be interpreted as labeled
training instances: The stimulus is identi ed with the input vector and the point
in the MDS space is used as its label.</p>
        <p>Now an ANN can be trained to map from stimuli (e.g., images) to points in
the MDS space. As the number of stimulus-point pairs is quite low for a machine
learning problem, one needs to introduce an additional training objective (e.g.,
minimizing the reconstruction error on the full data set) to avoid over tting and
to ensure that the network is able to generalize to unseen inputs.</p>
        <p>If training the ANN is successful, then it has learned a mapping from stimuli
to points in the conceptual space. While MDS only maps from a xed set of
stimuli to points in the conceptual space, the ANN is expected to generalize this
mapping to previously unseen stimuli. Moreover, as psychologically derived data
was used to train the network, it can claim at least some psychological validity.
Figure 2 illustrates three potential network architectures for image-point
mappings, where the dashed grey lines illustrate the reconstruction constraints used
in the training procedure. In all cases, the respective network is trained on a
secondary task for all images from the full data set and on the mapping task
only for the images that were used in the psychological study. Using a secondary
task with additional training data constrains the network's weights and can be
seen as a form of regularization: These additional constraints are expected to
counteract over tting tendencies, i.e., tendencies to memorize all given mapping
examples without being able to generalize.</p>
        <p>The left part of Figure 2 shows a standard feed-forward network which is
supported by a secondary task of predicting the correct classes. This approach
is however only applicable if the data set contains class labels.</p>
        <p>The middle part of Figure 2 shows the network structure of an autoencoder:
The network is trained to minimize the di erence between the input images
and their reconstruction, while being encouraged to use the MDS space in its
bottleneck layer. As the computation of the reconstruction error does not need
any class labels, this is applicable also to unlabeled data sets. Of course, instead
of using a regular autoencoder, one can also employ VAE or -VAE. The
autoencoder structure has the additional advantage that one can use the decoder
network to generate an image based on a point in the conceptual space.</p>
        <p>
          The right part of Figure 2 shows an extended version of the InfoGAN
framework, where the discriminator network is also constrained to correctly extract
the MDS points from the images. This is in some sense reminiscent of the
ACGAN network [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] which however predicts categorical class labels instead of
continuous points in a semantic space. Also this network architecture is capable
of generating images based on a given point in the conceptual space.
        </p>
        <p>Instead of learning a direct mapping from images to points in a conceptual
space, one can also train a network to predict the similarity ratings for pairs
of images. In this case, the neural network receives two images as input and
predicts their similarity. The input dimensionality is twice as large in this case,
but there are also more training examples available. Again, an additional training
objective might be needed in order to prevent over tting. In principle, the three
approaches discussed above can be adapted also to this learning problem. One
can use such a network to indirectly map an image onto a point in the conceptual
space as follows: The network predicts the similarity of the given image to a xed
set of \anchor images" for which a mapping into the conceptual space is known.
One can then use the points representing these anchor images, together with the
predicted similarity ratings, to triangulate the point representing the new image.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Discussion</title>
        <p>What are the advantages of our proposed approach? As stated before, we
expect to obtain a neural network that learns a mapping from input images to
a psychological similarity space. We thus get a mapping that is both
psychologically grounded and that generalizes to unseen inputs. Even if the mapping
performance of the neural network is not perfect, a rough guess for the point
representing a previously unseen stimulus might still be quite useful.</p>
        <p>In general, one can distinguish two types of similarity: While perceptual
similarity is exclusively based on the immediate perceptual features of the stimuli,
conceptual similarity involves additional sources of knowledge, e.g., the expected
use of a depicted object or the typical context in which it can be found.</p>
        <p>Neural networks that are trained on a task such as reconstruction are likely
to yield a hidden representation which encodes perceptual similarity. The
human similarity ratings retrieved in psychological experiments might however also
include conceptual similarity, and thus implicit features such as the object's
perceived softness. If a neural network is not only trained to reconstruct images,
but also to predict their positions in a psychological similarity space, it is incited
to use also such implicit features in its internal representations. In some sense,
these psychological constraints could thus help the network to \see behind the
image" and to represent not only perceptual, but also conceptual features of the
stimuli.</p>
        <p>If the approach described above yields promising and useful results, one could
use the observation that -VAE and InfoGAN tend to discover meaningful
dimensions in order to devise a new MDS algorithm based on these networks.
This algorithm would train a standard -VAE or InfoGAN network while
ensuring through an additional term in the loss function that the hidden space
extracted by the network accurately re ects the psychological similarity ratings.
This overall algorithm would thus result in a spatial representation of
psychological similarities which generalizes to unseen images, uses interpretable dimensions,
and can be used to generate new images based on points in the conceptual space.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Feasibility Study</title>
      <p>In order to validate whether our proposed approach is worth pursuing and
support our claims with some initial empirical results, we conducted a feasibility
study based on a simple setup: We used an existing data set of similarity ratings
for images of novel objects and we trained a linear regression on the hidden
activations of a pre-trained image classi cation network.3
6.1</p>
      <sec id="sec-6-1">
        <title>Data Set</title>
        <p>
          For our feasibility study, we used existing similarity ratings reported for the
Novel Object and Unusual Name (NOUN) dataset [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], a set of 64 images of 3-D
objects that are designed to be novel but also look naturalistic. Participants of
Horst and Hout's study [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] were presented with 13 trials of 20 objects, randomly
assigned in a way such that all pairwise comparisons among the 64 images were
evaluated. Adopting the SpAM approach [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], on each trial, participants viewed
20 stimuli on a 4x5 matrix arrangement and had to re-arrange them such that
the distance between each pair of images in the nal con guration re ected
the dissimilarity of the two images. Based on the coordinates of each image in
the nal con gurations, an individual similarity matrix was calculated for each
participant. The similarity matrices of all participants were then averaged into
a global similarity matrix, using SPSS and the PROXSCAL algorithm.
        </p>
        <p>
          Using the similarity matrix from the study of Horst and Hout [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we
performed a metric MDS by running the SMACOF algorithm four times and keeping
the best results, with a maximum of 300 iterations per run, using the
precomputed dissimilarity measure. In the work by Horst and Hout, a four-dimensional
space was found to be the best trade-o between a low-dimensional space and a
good re ection of the original similarity scores. We ran two additional variations
of the MDS, resulting in similarity spaces with two, four, and eight dimensions,
respectively. Figure 3 illustrates the two-dimensional space along with some
example images of the dataset.
        </p>
        <p>
          As the number of images is quite low for a machine learning problem, we
augmented the data set by applying horizontal ips, random crops, a Gaussian
blur, manipulation of the image's contrast and brightness, additive Gaussian
noise, a ne transformations, and salt and pepper noise. For each of the original
64 images, we created 1,000 augmented versions, resulting in a data set of 64,000
images in total. We assigned the MDS point of the original image to each of the
1,000 augmented versions.
3 Code for reproducing this study can be found online at https://github.com/
lbechberger/LearningPsychologicalSpaces/ [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Due to the still limited variability of our training examples, we decided to use a
pre-trained network instead of training a new network from scratch. We used the
Inception-v3 network [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], which has been trained on ImageNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and which
has achieved a state of the art top-5 error rate of 3.46% when classifying images
into one of 1000 classes. We used the activations of the second-to-last layer as
a 2048-dimensional feature vector and trained a linear regression to map from
this feature vector to the points in the MDS space. We assume that this feature
vector, although derived from a di erent task, still provides enough information.
        </p>
        <p>Our network architecture is a special variant of the feed-forward network
proposed in Section 5.2: Instead of training both the mapping and the classi
cation task simultaneously, we use an already pre-trained network, keep its weights
xed, and augment it by an additional output layer which is trained separately.</p>
        <p>We have implemented four baselines against which we compare our system:
{ Zero: Always predict the origin, i.e., the (0; 0; : : : ; 0) point.
{ Mean: Always predict the mean of all MDS points from the training set.
{ Distribution: Draw a random sample from a Gaussian distribution which
was estimated from the MDS points in the training set.
{ Random Draw: Use a randomly selected MDS point from the training set
as prediction.</p>
        <p>We expect that our system is able to outperform all of these baselines. We
would however also like to investigate whether learning a mapping into an MDS
space is easier than learning a mapping into an arbitrary space of the same
dimensionality. Therefore, we trained and evaluated two versions of our system:
One of them was trained on the mappings from images to points derived by the
MDS. For the other version, we used the same MDS points, but we shu ed the
Con guration</p>
        <p>2D 4D 8D</p>
        <p>Training Testing Training Testing Training Testing
Zero baseline 0.4408
Mean baseline 0.4408
Distribution baseline 0.6287
Random draw baseline 0.6233
assignment of images to points. We ensured that all augmented images created
based on the same original image were still mapped onto the same MDS point.
With this shu ing procedure, we aimed to destroy any semantic structure
inherent in the original space. We expect that the regression works better for the
original mapping than for the shu ed mapping, as it contains more semantic
content and should therefore be easier to learn. By using a shu ed mapping
rather than a mapping to randomly chosen points in the space, we ensure that
the distribution of the target points remains constant. This is necessary in
order to make a meaningful comparison between the two regression con gurations.</p>
        <p>In order to evaluate both our system and the baselines, we used the root mean
squared error (RMSE), which is a standard metric for regression problems. The
RMSE metric has the same scale as the target space and can thus provide an
intuition about the system's performance if interpreted as the average distance
between the prediction and the ground truth. Due to the small amount of original
images, we performed an image-based leave-one-out evaluation: All augmented
images generated from one of the original images were used as test set, while all
other images were to train the linear regression. This was repeated for each of
the original images and the training and test errors were averaged across all of
these runs. As all augmented images based on the same original image can be
expected to still be quite similar to each other, this procedure ensures that the
system cannot simply \memorize" all 64 images.
6.3</p>
      </sec>
      <sec id="sec-6-2">
        <title>Results</title>
        <p>Table 1 shows the results obtained in our experiment for the three MDS spaces
with two, four, and eight dimensions, respectively. Among the di erent
baselines, the \zero baseline" consistently has the best performance on the test data,
followed closely by the \mean baseline". The two other baselines perform
considerably worse. Notably, the regression always has a much lower RMSE than any
of the baselines, indicating that the system is indeed capable of learning a
mapping from images to points in a psychological space. As expected, the RMSE for
the correct mapping is always considerably lower than for the shu ed mapping.
However, the regression always performs signi cantly better on the training data
than on the test data, which is a clear sign for over tting.</p>
        <p>Figure 4 illustrates the RMSE results during testing for the three di erent
similarity spaces. As one can see, the performance of the best baseline is quite
independent from the dimensionality of the similarity space. Although in theory
the regression problem becomes harder as the number of dimensions increases
(when predicting more coe cients of the output vector, one can make more
mistakes), both regression approaches perform better in higher-dimensional spaces.
The results presented in Section 6.3 con rm our hypotheses: Our system is able
to perform better than the baselines, indicating that a mapping from images
to MDS points can be learned. Moreover, the observed performance di erence
on the test set between learning the correct mapping and learning the shu ed
mapping indicates that the semantic structure of a similarity space facilitates
learning a generalizable mapping. The small di erence on the training data
indicates that both learning problems have a similar di culty.</p>
        <p>The regression also seems to bene t from a higher-dimensional MDS space.
As a higher-dimensional space derived with MDS can more accurately re ect the
original similarity ratings, this might be interpreted as the system learning ner
nuances of similarity if given the chance to do so. As all baselines have a similar
performance on all MDS spaces, the distribution of the points in the space seems
to have similar characteristics regardless of the number of dimensions. We
assume that also the regression on the shu ed mapping can bene t from a higher
number of dimensions, as the shu ing procedure may not be able to completely
destroy the structure of the similarity space { even after shu ing, the mapping
might still partially re ect the similarity ratings.</p>
        <p>These overall results are in line with our expectations and indicate that our
overall approach is sound and promising. The achieved performance is still far
from perfect, probably because of the observed over tting tendencies. However,
the goal of this early feasibility study was not to aim for optimal performance,
but to evaluate the principle idea of our approach. We suspect that the
relatively large amount of over tting is caused by the small set of original images
and the large size of the extracted feature vector. Although we applied a variety
of augmentation techniques to the original images, the resulting images might
still be too similar to each other. Moreover, as the feature vector contains 2048
entries, the linear regression has to estimate a large number of parameters.
Furthermore, as the network was pre-trained on a di erent data set with di erent
characteristics, we cannot expect a perfect generalization.</p>
        <p>
          Finally, one could argue that even humans might fail to achieve an RMSE
of approximately zero, as the given problem is inherently di cult, due to the
nature of the data set used. The MDS space was created based on the average
similarity ratings obtained in Horst and Hout's [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] study using a set of novel
unknown images, for which the similarity judgments might di er a lot among
individuals.
7
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>We have argued that the combination of neural networks with psychological
similarity judgments o ers a promising way for extracting a conceptual space from
data. This can help to make the framework of conceptual spaces more viable for
arti cial intelligence: Our proposed approach can potentially provide a
principled way of mapping sensory input to conceptual spaces, while still maintaining
some psychological validity. The results of our feasibility study are encouraging
and show that our approach is also feasible in practice.</p>
      <p>
        In future work, we will implement and evaluate our proposal in more depth by
exploring the remaining proposed network structures. We will conduct a
psychological study on a subset of ImageNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to investigate whether we can achieve
a better mapping performance if we use images from the same domain. The
data used in the current study has the shortcoming of not separating di erent
domains like color, shape, and size (as proposed by Gardenfors), but of treating
them as a single space. In order to evaluate whether this di erence impacts the
performance of our proposal, we will apply our approach to a single domain
(namely, shapes). Finally, we will also explore additional ways of evaluating the
mapping performance (e.g., by comparing to the original similarity ratings).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arjovsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Towards Principled Methods for Training Generative Adversarial Networks</article-title>
          .
          <source>In: ICLR</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bechberger</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kypridemou</surname>
          </string-name>
          , E.: lbechberger/
          <source>LearningPsychologicalSpaces: v0.1 (Apr</source>
          <year>2018</year>
          ), https://doi.org/10.5281/zenodo.1220053
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Representation Learning: A Review and New Perspectives</article-title>
          .
          <source>IEEE TPAMI 35(8)</source>
          , 1798{1828 (Aug
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Borg</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groenen</surname>
            ,
            <given-names>P.J.F.</given-names>
          </string-name>
          :
          <source>Modern Multidimensional Scaling: Theory and Applications</source>
          . Springer New York (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <issue>5</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funkhouser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanrahan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <source>ShapeNet: An Information-Rich 3D Model Repository. Tech. rep.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frixione</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaglio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Anchoring Symbols to Conceptual
          <source>Spaces: The Case of Dynamic Scenarios. Robotics and Autonomous Systems</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houthooft</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : InfoGAN:
          <article-title>Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets</article-title>
          .
          <source>In: NIPS</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          .
          <source>In: 2009 IEEE CVPR Conference</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Derrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schockaert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Inducing Semantic Relations from Conceptual Spaces: A Data-Driven Approach to Plausible Reasoning</article-title>
          .
          <source>Arti cial Intelligence</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Douven</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decock</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Egre</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Vagueness:
          <string-name>
            <given-names>A Conceptual</given-names>
            <surname>Spaces</surname>
          </string-name>
          <article-title>Approach</article-title>
          .
          <source>Journal of Philosophical Logic</source>
          <volume>42</volume>
          (
          <issue>1</issue>
          ),
          <volume>137</volume>
          {160 (Nov
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fiorini</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          , Gardenfors,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Abel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Representing Part-Whole Relations in Conceptual Spaces</article-title>
          .
          <source>Cognitive Processing</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <volume>127</volume>
          {142 (Oct
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Gardenfors, P.: Conceptual Spaces:
          <article-title>The Geometry of Thought</article-title>
          . MIT press (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Gardenfors, P.:
          <source>The Geometry of Meaning: Semantics Based on Conceptual Spaces</source>
          . MIT Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Giordano</surname>
            ,
            <given-names>B.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guastavino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAdams</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Comparison of Methods for Collecting and Modeling Dissimilarity Data: Applications to Complex Sound Stimuli</article-title>
          .
          <source>Multivar Behav Res</source>
          <volume>46</volume>
          (
          <issue>5</issue>
          ),
          <volume>779</volume>
          {
          <fpage>811</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Goldstone</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>An E cient Method for Obtaining Similarity Data</article-title>
          .
          <source>Behavior Research Methods, Instruments, &amp; Computers</source>
          <volume>26</volume>
          (
          <issue>4</issue>
          ),
          <volume>381</volume>
          {
          <fpage>386</fpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Higgins</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matthey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glorot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botvinick</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerchner</surname>
          </string-name>
          , A.:
          <article-title>-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework</article-title>
          .
          <source>In: ICLR</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Horst</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hout</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>The Novel Object and Unusual Name (NOUN) Database: A Collection of Novel Images for Use in Experimental Research</article-title>
          .
          <source>Behavior Research Methods</source>
          <volume>48</volume>
          (
          <issue>4</issue>
          ),
          <volume>1393</volume>
          {1409 (Dec
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Hout</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldinger</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferguson</surname>
            ,
            <given-names>R.W.:</given-names>
          </string-name>
          <article-title>The Versatility of SpAM: A Fast, E cient, Spatial Method of Data Collection for Multidimensional Scaling</article-title>
          .
          <source>Journal of Experimental Psychology: General</source>
          <volume>142</volume>
          (
          <issue>1</issue>
          ),
          <volume>256</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Jaworska</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chupetlovska-Anastasova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Review of Multidimensional Scaling (MDS) and its Utility in Various Psychological Domains</article-title>
          .
          <source>Tutorials in Quantitative Methods for Psychology</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Auto-Encoding Variational</surname>
          </string-name>
          Bayes.
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lieto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radicioni</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rho</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Dual</surname>
            <given-names>PECCS</given-names>
          </string-name>
          :
          <article-title>a cognitive system for conceptual representation and categorization</article-title>
          .
          <source>Journal of Experimental &amp; Theoretical Arti cial Intelligence</source>
          <volume>29</volume>
          (
          <issue>2</issue>
          ),
          <volume>433</volume>
          {
          <fpage>452</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Odena</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
          </string-name>
          , J.:
          <article-title>Conditional Image Synthesis with Auxiliary Classi er GANs</article-title>
          .
          <source>In: ICML</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Subkoviak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roecks</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Closer Look at the Accuracy of Alternative Data Collection Methods for Multidimensional Scaling</article-title>
          .
          <source>Journal of Educational Measurement</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <volume>30917</volume>
          (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          . In
          <source>: Proceedings of the IEEE CVPR Conference</source>
          . pp.
          <volume>2818</volume>
          {
          <issue>2826</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>