<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IJCNN.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1037//0278-7393.26.1.3</article-id>
      <title-group>
        <article-title>Reaction Time as an Indicator of Instance Typicality in Conceptual Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kypri</string-name>
          <email>elektra.kypridemou@st.ouc.ac.cy</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Loizos Mi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Open University of Cyprus</institution>
          ,
          <addr-line>Nicosia</addr-line>
          ,
          <country country="CY">Cyprus</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <volume>5178760</volume>
      <fpage>3</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>In typical categorization tasks, humans are presented with a sequence of instances and report whether each instance is a member of a given category or not. In the current study, we examine the relationship between the reaction times (RTs) of human participants and the position of the instance in the conceptual space. Our main hypothesis is that instances closer to the boundary of the two categories, which are harder to be categorized, will require longer cognitive processing, resulting in longer RTs. Human subjects categorized images of novel objects to one of two given categories (represented by images of their prototypes); the selected category, RT and confidence rating for each trial were recorded. For trials with longer RTs people responded with less confidence and were more prone to making errors than for trials with shorter RTs. Moreover, people responded faster to stimuli with high similarity to at least one of the prototypes of the given categories than to stimuli that were distant from both prototypes, and hence closer to the boundary of the two categories, confirming our main hypothesis.</p>
      </abstract>
      <kwd-group>
        <kwd>Conceptual Spaces</kwd>
        <kwd>Categorical Perception</kwd>
        <kwd>Exemplars</kwd>
        <kwd>Prototypes</kwd>
        <kwd>Classification</kwd>
        <kwd>Categorization</kwd>
        <kwd>Reaction Time</kwd>
        <kwd>Confidence Rating</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In typical supervised and semi-supervised learning settings of machine learning,
human teachers are presented with a series of elements and are asked to report for each
element whether it is a member of a given category or not, usually by assigning a
positive or a negative label to the element. Similarly, in psychophysics experiments,
participants are given a series of stimuli and are asked to decide for each stimulus in
which of the given categories it belongs. In such experimental designs, even if the
task does not explicitly require a positive/negative label, the given categories are
usually two well defined and complementary concepts, making the task analogous to the
supervised learning training setting of machine learning. For example, Graf and
Wichmann [5] implemented a gender categorization task with visual images of human
faces, where participants had to categorize images to males and females. If we assume
that the two categories of males and females are complementary (i.e., each face is
either male or female), then the task would be logically equivalent to assigning
positive and negative labels w.r.t. one of the categories (even if, psychometrically
speaking, changing the instructions of the task could possibly alter the results).</p>
      <p>Certain previous approaches combined empirical psychophysics results and
machine learning, aiming at a better understanding of human categorization processes.
On the contrary, the main purpose of the present work is towards the opposite
direction. Instead of using input from machine learning techniques to explain the
experimental psychophysical results, our aim is to examine how the use of additional input
coming from human teachers could presumably improve the existing machine
learning techniques.</p>
      <p>Existing predictive models of supervised and semi-supervised classification use
labels produced by human teachers as a training set to classify future observations. We
posit that considering reaction times (RTs) as an indicator of instance typicality in
conceptual spaces, and incorporating RTs in the training material of the machine
learning procedures, could possibly lead to better classification algorithms. As a first
step towards this direction, in the current work, we examine the relationship between
the RTs and the position of the instance to be categorized (target) in the conceptual
space. Given that (i) RTs are found to provide a good approximation of distance
between the element to be classified and the SH [5–7], and that (ii) in experimental
settings it is easier to measure RTs than distances, which are internal representations of
human minds, we argue that “considering RTs in addition to the labels given by
human teachers in supervised and semi-supervised settings, could potentially provide
valuable input for more efficient learning algorithms”. Specifically, based on
previous experimental results, we suggest that targets closer to the boundary of two
categories are harder to be categorized, in the sense that they require longer cognitive
processing, which is manifested by longer RTs.</p>
      <p>Although we are unaware of any previous work trying to examine the above
hypothesis, there has been work that examined an analogous hypothesis using the
selfreported confidence of the users (confidence rating; CR). Ji and Lu [11] developed
SVMAC, a novel support vector machine with automatic confidence, which is found
to be significantly more accurate for gender classification than other traditional
algorithms. Conceivably, one could also consider additional information from the
teachers, including for example an explanation regarding their judgments, beyond their CR,
to gain even more quantitative input about their decisions. If the improvements of
such additional requirements are significant, then it might be worthwhile sacrificing
some of the teacher’s time for better machine learning performance.</p>
      <p>Unlike Ji and Lu’s [11] suggestion towards more efficient classification
algorithms, the approach we suggest does not require any extra effort or time by the
teachers, since the value of the RT is automatically recorded along with the teacher’s
response. As an extension of our approach, we could also consider other types of
passive sources of information, acting as valuable input for our algorithms. For
example, using some eye-tracking techniques during an image categorization task, we
could track the visual processing of the stimuli. Examining the parts of the image
where the eye is focused for longer time periods we could gain some valuable insight
about the features, or the parts of the images, that guided the teacher’s decision.
Combining quantitative results coming from RTs with qualitative information coming
from eye-tracking techniques could give us some valuable insight into the cognitive
processing and the factors that guided the decision making for each label.</p>
      <p>The experiment of this paper is part of a longer research path towards our goal. The
next step is to practically test whether the use of such additional input in the
implementation of learning algorithms accelerates the learning process and improves
the efficiency of the algorithms. The use of additional input could be implemented in
several ways. One way is to filter the responses based on certain criteria and exclude
the responses that do not meet these criteria from the training data. For example,
excluding the responses for which the RTs are shorter than a minimum value (to
avoid instances selected without any processing of information), or ignoring the
responses for which the RTs are longer than a maximum value (implying less typical
instances of a category) could be some types of filtering. Another way is to implement
some already established techniques for using additional information such as the
LUHI [18] and the LUPI [17] paradigms. In such techniques, the additional
information is only provided during the training phase and is not available during the
testing phase.</p>
      <p>In the following sections, we demonstrate current empirical work on
categorization, followed by a detailed description of the experimental design and the stimuli we
used. We provide a detailed explanation of why we chose such a setting, and point out
which methodological gaps of previous studies we are trying to fill. We then provide
a more detailed description of the method we used regarding the participants, the
materials used, the experimental design, and the procedure we followed. After
presenting and discussing the results, we conclude our findings and we suggest the next
steps to be taken.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Current Empirical Work on Categorization</title>
      <p>Previous studies [5, 6] have reported lower RTs for correct responses than for
incorrect ones, indicating that people respond faster when their response is correct. There is
also experimental evidence [5, 6] that for higher metacognitive judgments of
confidence the RTs are lower, indicating that people respond faster when they are more
confident about their response. Moreover, participants of classification tasks were
found to have metacognitive abilities, since their self-reported CR is negatively
correlated with the classification error (CE; Eq. (1)); i.e., people are more confident about
their choice when their response is correct [5]. Altogether, the above findings imply
that longer RTs indicate classification cases in which people respond with less
confidence and are more prone to making errors. In other words, cases that are more
‘difficult’ to be classified by humans require longer processing of information by the
human brain. But which are these ‘difficult’ cases to be classified?</p>
      <p>Taking it a step further, Graf and his colleagues [5–7], in order to better understand
human classification processes, compared psychophysics results to machine learning
techniques. They asked human participants to classify images of human faces to
males and females, and correlated the human responses to the distance between the
stimuli and the separating hyperplane (SH), as provided by several learning
algorithms. What they found is that people are more accurate, respond faster, and report
higher confidence for their judgments when they classify human faces that are farther
from the SH, than for those closer to the SH. Hence, one could argue that the
‘difficult’ cases to be classified come from the stimuli closer to the SH, while stimuli that
are farther from the SH are classified easier.</p>
      <p>However, using the above experimental designs, human responses about categories
(as well as the corresponding RTs and CRs) might be affected by (i) participants’
prior knowledge and personal interpretation of the given categories, and (ii)
previously presented stimuli from the same category acting as exemplars of the category, a
phenomenon known as the “old-items advantage effect”.</p>
      <p>In the present paper we are going to explore the relations between input from
humans performing a categorization task and the similarity between the stimulus to be
categorized (target) and the prototypes of the candidate categories. At the same time,
we will try to limit potential effects arising from the nature of previous experimental
designs, and check whether results are replicated. In the following paragraphs, we
describe a new experimental design that addresses the above effects.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Introducing our approach</title>
      <p>In the experimental design we used, participants were presented with three images of
novel objects and were asked to categorize the image on the left part of the screen (the
target t) in one of the two given categories, represented by two images a and b, on the
right part of the screen (Fig. 1). After their selection, they were asked to report their
confidence about their decision, on a scale from 1 (unsure) to 3 (sure). For each trial,
we recorded three values: (i) the selected category, (ii) the reaction time (RT), and
(iii) a self-reported confidence rating (CR) about the response. The experiment
comprised eighty trials, which were presented sequentially to the subjects in random
order.</p>
      <p>The nature of the stimuli (images of novel objects) as well as our experimental
design (presenting randomly-created triplets of images in each trial) resulted in a less
straightforward categorization task. In some cases, the item to be categorized (target)
could not easily fit to any of the given categories, while in some other cases the target
could almost equally fit to both given categories. The purpose of such a setting was
twofold. First, to test our hypothesis, we needed a range of possible arrangements of
the target and the prototypes of the two categories in the conceptual space. Second,
we argue that using a setting where the items to be categorized are not always clearly
members of one and only one category better simulates more realistic situations. For
example, everyday objects could be members of more than one category (e.g., a
smartphone is also a camera), images might depict more than one object or concept
(e.g., a picture of a beach view depicts the sea, the sun, the sky and maybe more
concepts all at once), excerpts of text do not always have a unique style to be
characterized (e.g., a text might be characterized as scientific and educational at the same
time), and users’ reviews might involve more than one emotion (e.g. a buyer might be
angry and also disappointed by a product). In such cases, where there is not only one
unique category where the instance fits, considering additional input such as the RTs
could give us some more insight about the most dominant or representative category
among all the candidate categories. In the following two sections we introduce some
more technical benefits arising from our experimental design and we explain how our
design limits potential effects that might be present in the standard classification
tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>Limiting possible prior knowledge effects for concept representation</title>
      <p>Human categorization of instances in commonly known categories such as males
and females inevitably triggers effects arising from individual differences based on
participants’ prior knowledge related to the given categories. Such differences might
arise either by individual experiences or by other social or geographical factors (e.g.,
Asian male faces significantly differ from Caucasian male faces). Human information
processing and decision-making depend on personal pre-existing mental
representations of the category, whether the category is represented by a prototype, a set of
exemplars, or even by a set of rules of necessary and sufficient conditions. Even if
experimenters explicitly ask participants to ignore any prior knowledge about the
category and base their judgments only on some given prototypes or rules, it is not
guaranteed that such effects of the prior knowledge will be successfully inhibited.</p>
      <p>
        To avoid any pre-conceived categories, in our experiment we use a categorization
task of unfamiliar objects coming from the NOUN database [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ], a collection of 64
images of novel objects specially created for experimental research studies. Since
participants are not familiar with the visual stimuli of the task, and hence they have no
a priori knowledge of the target images and the categories represented1, we argue that
the prior-experiences that might influence participants’ behavior are being limited.
      </p>
      <p>Moreover, experiments in previous studies make space for individual
representations of the categories based on prior knowledge, allowing participants to use their
own prototypes of the category. In our experiment, instead of naming the given
categories, we represent categories with images coming from the NOUN database. This
way, we explicitly define the prototype of each category by an image, preventing any
possible subjective interpretations of the categories. Participants, having no other clue
to base their decision, are somehow ‘forced’ to use the given image as the category’s
prototype.</p>
    </sec>
    <sec id="sec-5">
      <title>Controlling the use of exemplars</title>
      <p>
        Considering the interaction between the prototype-based and the exemplars-based
categorization processes [1, 3, 4, 12, 13, 19], shorter RTs do not necessarily indicate a
lower distance between the stimulus and the prototype. Experimental psychology
results [
        <xref ref-type="bibr" rid="ref6">14–16</xref>
        ] have shown that stimuli that are found to be similar to previously
encountered exemplars of the category are categorized more easily (i.e., faster and
more accurately) than non-familiar stimuli that are equally typical (or even more
typical) members of the category. Moreover, when there is a pre-encountered exemplar of
the category corresponding to the stimulus to be categorized, the categorization
process is based on the similarity between the stimulus and the known exemplar rather
than between the stimulus and the prototype. This privilege of the exemplar w.r.t. the
prototype is known as the “old-items advantage effect”. To highlight even more this
effect, Hahn et al. [8] reported that exemplar similarity was dominant even in cases
where basing categorization on a given rule would lead to perfect performance.
      </p>
      <p>Even if in our experiment we use a categorization task of novel objects, according
to the “old-items advantage effect”, previously-encountered targets of a category
could favor the categorization of new targets to the same category. This is why in our
experimental design the candidate categories change between trials, instead of being
fixed throughout the experiment. This way, we ensure that the only representation of
the categories that will be used by the participants will be the prototype, as it is
defined by the experimenters for each trial.</p>
      <p>However, since the available images from the NOUN database were limited, some
of the images would inevitably be presented more than once throughout the
experiment (either in the form of a target or in the form of a prototype of a category). To
control any sequential effects caused by previously presented images, we randomized
the order of the trials for each participant.</p>
    </sec>
    <sec id="sec-6">
      <title>Overcoming the obstacles caused by using unspecified concepts</title>
      <p>1 Please note that even if the novelty of the objects implies that the categories are not
welldefined a priori, this does not imply that the categories of such objects are not pre-defined in
the sense of Barsalou’s ‘ad hoc’ categories [2], which are categories of known familiar
objects.</p>
      <p>Given that in our experiment the two categories of each trial are represented only
by an image, and that the two categories differ from trial to trial, we only have two
points in the conceptual space for each trial (acting as a prototype of the category).
Therefore, the boundary between the two categories cannot be computed. In other
words, the use of unspecified concepts implies the absence of an explicit boundary
separating the two given categories.</p>
      <p>To overcome this obstacle, we approximate the notion of distance between the
target to be categorized and the boundary of the two categories, by using the notion of
distance between pairs of images (i.e., the target and the prototype of each category).
For images of the NOUN database, the empirically derived distance between all pairs
of images is provided (see Materials section). This is one more reason why we
decided to use the NOUN database.</p>
      <p>Using the above approximation, we make the following assumptions. First,
comparing the two distances (i.e., the distance between the target and the prototype of
each category), we can prescribe the expected categorization of the target (which we
are going to consider as the 'correct' label). Second, looking at the value of the two
distances, we could get an idea about the position of the target in the conceptual space
w.r.t the boundary separating the two categories. When the target is distant from both
categories, we suppose that it is close to the boundary of the two categories. On the
contrary, in cases where the target is close to one prototype and distant from the
other one, we suppose that the target is distant from the boundary, lying on the side of
the closest prototype. Under these assumptions, we are going to examine whether our
experimental results are consistent with previous work, despite the methodological
differences between the two experimental settings.
2</p>
      <sec id="sec-6-1">
        <title>Empirical Method</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Participants</title>
      <p>Our data derived from a human sample of 40 adults (25 males, 15 females), aged
22 to 66, who participated voluntarily to the study by completing an online
experimental task. Participants were naïve to the purpose of the experiment and received no
financial or other compensation for their participation. They reported to have a normal
or corrected-to-normal vision and provided informed consent. All participants
completed all trials of our experiment.</p>
      <p>For the descriptive analysis, we used the entire data set, while for the remaining
part of our analysis we excluded two participants who were identified as outliers
based on their RTs (see Results section).</p>
    </sec>
    <sec id="sec-8">
      <title>Materials</title>
      <p>
        In our experiment, we use the NOUN database [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ], a collection of 64 images,
specifically designed for experimental research, especially for categorization studies.
The objects depicted in the images are naturalistic, complex, multipart and
multicolored, three-dimensional real objects [9], which in some respects resemble
everyday familiar objects but at the same time are distinct and novel.
      </p>
      <p>Sixty of the images were used in our experiment due to their higher quality, while
the remaining 4 images were used only in the practice session. All images that we
used were resized to 300×300 pixels, to ensure fast loading during each trial of the
experiment.</p>
      <p>Additionally, the NOUN database comes with a similarity matrix, providing a
similarity rating for each pair of images. To obtain these ratings, Horst &amp; Hout [9]
performed an experiment, based on the spatial model of similarity. In their experiment
participants completed a task of spatial arrangement, comprising 13 trials. In each
trial, participants were given 20 images of the NOUN database and they were asked to
arrange the images in the two-dimensional space, based on their perceived similarity
(i.e., more similar items placed closer). Following the participants’ ratings, the
experimenters calculated all pairwise similarity ratings using multidimensional scaling
(MDS) on the Euclidean distance for each pair of images. Lastly, Horst &amp; Hout
rankordered all pairs of images into four quartiles, based on the distances between their
elements. Pairs belonging to the first quartile were the most similar pairs, while pairs
of the fourth quartile were the most dissimilar ones. In our experimental design, we
group the pairs of images to similar and dissimilar, based on the given quartiles of
Horst &amp; Hout.</p>
    </sec>
    <sec id="sec-9">
      <title>Experimental design</title>
      <p>For our experiment, we created ordered triplets (t, a, b) of images, one for each
trial, where (i) t is the target to be categorized, (ii) a is the prototype of category A, and
(iii) b is the prototype of category B. Using the 60 of 64 images of the NOUN
database, we created 205,320 = 60∙59∙58 ordered triplets of different images (t ≠ a ≠ b),
by creating all possible permutations of 60 without repetition.</p>
      <p>We then characterized the above triplets based on the similarity ratings of each pair
(t, a), (t, b), (a, b) of images, as provided by the creators of the database using a
multidimensional scaling analysis [9]. To limit the number of our experimental
conditions, we created two groups of pairs; pairs of similar items (by merging the first and
second quartiles), and pairs of dissimilar items (by merging the third and fourth
quartiles). Subsequently, we named the families of triplets w.r.t. the similarity between the
elements of each pair (t, a), (t, b), (a, b). Pairs (t, a) and (t, b) were characterized as
High (H) when their elements were similar, and as Low (L) when their elements were
dissimilar. Similarly, pairs (a, b) were characterized as Similar (Sim) or Dissimilar
(Dis) when the prototypes a, b of categories A, B were similar or dissimilar,
respectively. Based on the above terminology, we ended up with the following families of
triplets: LL-Sim, LL-Dis, LH-Sim, LH-Dis, HL-Sim, HL-Dis, HH-Sim, HH-Dis,
which consisted the eight conditions of our experimental design (Table 1, Fig. 2).
Fig. 2. Percentages of triplets produced for cases of similar (left) and dissimilar (right) pairs of
prototypes (a, b). The blue and orange circles represent all similar pairs (t, a) and (t, b),
respectively.</p>
      <p>In examining the families of triplets produced, we made some surprising
observations. Presumably, one would not expect to find any HH-Dis pairs, since this would
imply that the target t is highly similar to both the prototypes a and b, while a and b
are not similar to each other. Similarly, LH-Sim and HL-Sim families of triplets were
also unexpected, since in such scenarios the target t would be similar to only one of
the two prototypes a and b, while a and b are similar to each other. However, such
families of triplets, that we considered as less possible, were also created, in smaller
proportions (Table 1, Fig. 2).</p>
      <p>Regardless the size of the produced families of triplets, we balanced our final
experiment across conditions. Hence, the final experiment consisted of 80 trials in total,
10 trials for each condition, which were selected uniformly at random. The pool of the
final selected triples was fixed for all participants, but trials were presented in random
order for each participant, to avoid any sequential and order effects.</p>
    </sec>
    <sec id="sec-10">
      <title>Procedure</title>
      <p>Participants were personally invited to participate in our study. Before initializing
the procedure, we had a personal session with each participant, to make sure all
experimental criteria were met. First, we made clear that a desktop or laptop is needed
for participation (no mobiles, tablets, or other smart devices were allowed). In case
they reported the use of a laptop, we imposed the use of a mouse (instead of the
laptop’s trackpad) for submitting their answers. Even if the web interface was light
enough to ensure flawless loading between trials, we also made clear that an average
Internet connection speed is necessary during the experimental task. Finally, we
strongly recommended that participants were in a quiet environment with no
distractors, while completing the experiment.</p>
      <p>After making sure that all above criteria were met, we sent to the participants a first
link to one of the experiment’s trials (from the practice phase) to calibrate their
browsers. We guided them to zoom in / zoom out their browsers so that the frame
surrounding all three images of the trial would cover most of the surface of their
monitor. After everything was set, we sent them a second link directing them to the web
interface of the experiment and invited them to start.</p>
      <p>On the first screen of the experiment, participants were informed about the study
and completed an electronic consent form. A screen with detailed instructions
followed, where participants were informed about the task, the timing and the
selfreporting rating about their confidence for each response. Regarding timing,
participants were advised to answer as fast as possible without sacrificing accuracy, so that
we ensure that their decisions involved not only perceptual but also conscious
cognitive processing. At the end of the instructions, participants were informed that a
practice phase will follow, to ensure that the procedure is clear.</p>
      <p>The practice phase consisted of four trials, identical to the trials of the actual
experiment, during which no responses were recorded. Images that were presented in the
practice phase were excluded from the actual experiment. After the practice was
completed, participants were informed that the experiment begins.</p>
      <p>In each trial of the experiment, a triplet (t,a,b) was randomly selected from the pool
of the pre-selected triplets of the experiment. To record the RTs, time started counting
by the time all three images t, a, and b were presented on the screen and stopped as
soon as the participant clicked on one of the two images a, and b. After their
selection, a smaller window appeared and participants had to evaluate their confidence
about their previous response. Participants selected one, two, or three stars, to report
their confidence level and then they had to click on the “Show next” button to proceed
to the next trial. To control the distance between the position of the mouse when
initializing a trial and each category image a, b, we placed the “Show next” button in a
position equidistant from both category images. Eighty trials (ten trials from each
condition) sequentially appeared in random order for each participant. After
completing all trials of the experiment, we thanked participants and redirected them to the
webpage of our lab.
3
3.1</p>
      <sec id="sec-10-1">
        <title>Results</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Descriptive statistics</title>
      <p>Since our experimental design and the nature of the stimuli did not allow for an
“objective truth”, ‘correct’ responses were considered only for the families of triplets for
which the target t was similar with one of the prototypes and dissimilar with the other
one (i.e., for the LH-Sim, LH-Dis, HL-Sim, and HL-Dis families). For these less
‘ambiguous’ families of triplets, we considered as ‘correct’ response the prototype a or b
that was similar to the target t. For example, for trials coming from the family of
triplets LH-Sim, the ‘correct’ response was the image b (i.e., the one positioned on the
bottom right of the screen), while for trials from the family HL-Sim, the ‘correct’
response was the image a (i.e., the one on the top right of the screen).</p>
      <p>Based on the number of ‘correct’ and ‘wrong’ responses given by participants for
each trial, we calculated the variable classification error (CE) by dividing the number
of wrong responses to the number of the valid responses given for each trial (1). The
CE value could only be calculated for the families of triplets where the ‘correct’
response could be defined (i.e., for the less ‘ambiguous’ families).
Descriptive statistics for CE, RT, and CR are shown in Table 2, Table 3, and Table 4,
respectively.
Bivariate correlations between (a) RT and CE, (b) CR and CE, and (c) RT and CR
were also calculated. Correlations (a) and (b) were calculated only for triplets where
the CE could be calculated (i.e., only for the non-‘ambiguous’ families of triplets;
N=1600), while correlation (c) was calculated for the entire dataset (N=3200).</p>
      <p>According to the results, (a) there was a significant correlation between the RT and
the CE, r = .077, p (one-tailed) &lt; .01, indicating that people spent more time for trials
for which they selected the wrong category, (b) there was a significant correlation
between the CR and the CE, r = -.097, p (one-tailed) &lt; .01, indicating that people
were less confident for trials for which they selected the wrong category, and (c) there
was a significant correlation between the RT and the CR, r = -.143, p (one-tailed) &lt;
.01, indicating that people spent more time for trials for which they were less
confident.
3.3</p>
    </sec>
    <sec id="sec-12">
      <title>Screening data and testing assumptions</title>
      <p>All participants fully completed the experiment, and hence there were no missing
values in our dataset. For each condition of the experiment, we tested our data for
normality. Since normality assumption was violated, we checked for cases identified
as outliers (i.e. participants with high RTs compared to the sample’s mean RT). Two
participants were identified as outliers in most of the experiments’ conditions (6 of 8
and 7 of 8 conditions, respectively), and a third one only in 3 of 8 conditions (HH
Dis, HH-Sim, and LL-Sim). The first two were excluded from the sample, whereas
for the third one we used winsorization to limit extreme values. Hence, for the rest of
our analyses, our final sample consisted of 38 participants (n=38). After the above
corrections, the assumption of normality for RT was met.
3.4</p>
    </sec>
    <sec id="sec-13">
      <title>Examination of the RTs</title>
      <p>To examine the RTs among the eight families of triplets, we considered the variable
Target Position, with four levels (HH, HL, LH, HH), and the variable Categories
Similarity, with two levels (Sim, Dis), and we conducted a two-way repeated measures
analysis of variance for these two within-subjects factors (Fig. 3).</p>
      <p>Mauchly’s test indicated that the assumption of sphericity was not violated for both
the Target Position factor ( 2(5) = 1.90,  &gt; .05), and for the interaction of the two
factors ( 2(5) = 6.10,  &gt; .05). The results show that there was a significant main
effect for both the Target Position ( (3,111) = 9.53,  &lt; .01), and the Categories
Similarity ( (1,37) = 8.47,  &lt; .01), as well as for their interaction ( (3,111) =
2.82,  &lt; .05).</p>
      <p>Further analysis of pairwise comparisons revealed that there was a significant
difference of the average RTs only between the (HH,LH), (HH,LL), and the (HL,LL)
Target Positions ( &lt; .01). There was also significant different between the Similar
and Dissimilar triplets ( &lt; .01).</p>
      <p>Even if the mean RTs between the HL and LH Target Position were not found to
be significantly different, we observed that participants did not behave the same in
these two cases, which was unexpected. To further explore this trait, we had to
consider some additional factors. One possible interpretation could be that the ten triplets
selected for each of the families of triplets were not balanced. Another interpretation
could be that the position of the two prototypes a and b also influences the RTs, and
hence the decision-making process.</p>
      <p>To check our first assumption, we examined whether the pairs consisting the
triplets of the HL-Sim, HL-Dis, LH-Sim, and LH-Dis families were biased w.r.t. their
similarity ratings. The mean similarity rating for the low similarity pairs was 540,15
and 406,05, for the HL and LH cases respectively. The mean similarity rating for the
high similarity pairs was 1440,05 and 1481,80, for the HL and LH cases respectively.
This is an indicator that triplets were balanced between HL and LH cases.</p>
      <p>To examine the second assumption, we ignored the analysis of the participants’
responses w.r.t. the ‘correct’ response and we only examined the responses w.r.t. the
position of the selected image (i.e., top right of the screen or bottom right of the
screen). Results show that for trials where the ‘correct’ response was at the bottom
(LH-Dis, LH-Sim), participants’ accuracy was not better than a random selection,
whereas for trials where the ‘correct’ response was at the top, people tended to select
the ‘correct’ response, regardless its position (Table 5).
* Underlined values indicate statistically significant differences (p &lt; .05) between the
means of RTs and CRs of the two independent groups.
4</p>
      <sec id="sec-13-1">
        <title>Discussion</title>
        <p>Our results replicate previous findings exploring the meaning of RTs in categorization
tasks while limiting potential effects arising from the nature of previous experimental
designs. For trials with longer RTs people responded with less confidence and were
more prone to making errors than for trials with shorter RTs, which is consistent with
previous work. Moreover, people responded faster for targets with high similarity to
at least one of the prototypes of the given categories (HL and LH conditions) than for
targets that were distant from both prototypes (LL), and hence closer to the boundary
of the two categories, confirming our main hypothesis.</p>
        <p>The shortest RTs were found in the HH-Dis family of triplets, where the target t
was similar to both prototypes a and b, but the two prototypes were dissimilar.
Although we expected that trials from this family would require longer processing in
order to choose the best option, the experimental results showed that this was the case
where participants responded faster. Additionally, we also found the highest average
CR for this family of triplets, with most people reporting they were almost confident
about their selection (self-rated their confidence with 2 stars out of 3). One
interpretation of this phenomenon could be that participants, as soon as they identified one
fitting category for the target, did not spend any extra time for checking whether there
is a second fitting category or trying to decide which is the most appropriate one
among the two. Hence, lower RTs do not always indicate instances typical for one
category and not typical for the other, as we initially assumed. This could be a very
useful finding for cases where targets could be members of more than one category,
since lower RTs do not always imply excluding the categories which were not
selected by the participant.</p>
        <p>Finally, the fact that the HL and LH conditions were not symmetrical, highlights
the need for a further examination of other factors, such as the position that appear the
candidate categories.
5</p>
      </sec>
      <sec id="sec-13-2">
        <title>Conclusion</title>
        <p>The above results, though preliminary, are very promising. First, they replicate
previous findings exploring the meaning of RTs in categorization tasks, while limiting
potential effects arising from the nature of previous experimental designs. We
consider that replicating previous results even with the use of novel images that form
unspecified concepts, indicates that our basic hypothesis is primitive w.r.t the basic
processes of human categorization. Second, the experimental design we used, combined
with the findings of the present study, uncover many hidden aspects of previous
studies, opening the way to future work towards multiple directions.</p>
        <p>We are currently investigating possible bias effects arising from the position of the
prototypes (top / bottom) or by any other presentation effects. Eye-tracking
techniques can also be used to better interpret findings from RTs, as a quantitative
method of the cognitive processes involved in the task, as well as a tool for exploring
other possibe effects and revealing biases. Future work could also involve
experimentation with more familiar stimuli, such as (i) images of familiar objects, (ii) images
depicting more than one objects, or (iii) excerpts of text, which could be characterized
by multiple labels, etc.</p>
        <p>Acknowledgements.</p>
        <p>We thank Christos Rodosthenous for assistance with creating the web interface of the
experiment and for comments that greatly improved the manuscript.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Anderson</surname>
            <given-names>JR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betz</surname>
            <given-names>J</given-names>
          </string-name>
          (
          <year>2001</year>
          )
          <article-title>A Hybrid Model of Categorization</article-title>
          .
          <source>Psychon Bull Rev</source>
          <volume>8</volume>
          :
          <fpage>629</fpage>
          -
          <lpage>647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Barsalou</surname>
            <given-names>LW</given-names>
          </string-name>
          (
          <year>1983</year>
          )
          <article-title>Ad hoc categories</article-title>
          .
          <source>Mem Cognit</source>
          <volume>11</volume>
          :
          <fpage>211</fpage>
          -
          <lpage>227</lpage>
          . doi:
          <volume>10</volume>
          .3758/
          <string-name>
            <surname>BF03196968 Frixione</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lieto</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Prototypes vs</article-title>
          . Exemplars in Concept Representation.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Procedings of KEOD</source>
          <year>2012</year>
          ,
          <article-title>Int Conf Knowl Eng and Ontol</article-title>
          . Dev,
          <volume>226</volume>
          -
          <fpage>232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Frixione</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lieto</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Representing concepts in formal ontologies</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Compositionality vs</article-title>
          .
          <source>typicality effects. Log Log Philos</source>
          <volume>21</volume>
          :
          <fpage>391</fpage>
          -
          <lpage>414</lpage>
          . doi:
          <volume>10</volume>
          .12775/LLP.
          <year>2012</year>
          .018
          <string-name>
            <surname>Graf</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wichmann</surname>
            <given-names>F</given-names>
          </string-name>
          (
          <year>2004</year>
          )
          <article-title>Insights from Machine Learning Applied to Human Visual Classification</article-title>
          . In: Thrun S,
          <string-name>
            <surname>Scholkopf</surname>
            <given-names>B</given-names>
          </string-name>
          <source>(eds) Adv. Neural Inf. Process. Syst.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          16, Nips-
          <fpage>16</fpage>
          . MIT Press, Cambridge, MA, pp
          <fpage>905</fpage>
          -912
          <string-name>
            <surname>Graf</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wichmann</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bülthoff</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schölkopf</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2003</year>
          )
          <article-title>Study of Human Classification using Psychophysics and Machine Learning</article-title>
          .
          <volume>6</volume>
          :
          <fpage>149</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Graf</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wichmann</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bülthoff</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schölkopf</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2006</year>
          )
          <article-title>Classification of faces in man and machine</article-title>
          .
          <source>Neural Comput</source>
          <volume>18</volume>
          :
          <fpage>143</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .1162/089976606774841611 Hahn U,
          <string-name>
            <surname>Chater</surname>
            <given-names>N</given-names>
          </string-name>
          (
          <year>1998</year>
          )
          <article-title>Similarity and rules: distinct?</article-title>
          <source>Exhaustive? Empirically distinguishable? Cognition</source>
          <volume>65</volume>
          :
          <fpage>197</fpage>
          -
          <lpage>230</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0010-
          <volume>0277</volume>
          (
          <issue>97</issue>
          )
          <fpage>00044</fpage>
          -9
          <string-name>
            <surname>Horst J (2009) Novel</surname>
          </string-name>
          <article-title>Object &amp; Unusual Name (NOUN) Database [PDF document]</article-title>
          . 1-
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Horst</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hout</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>The Novel Object and Unusual Name (NOUN) Database: A collection of novel images for use in experimental research</article-title>
          .
          <source>Behav Res Methods</source>
          <volume>1</volume>
          -17.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>doi: 10</source>
          .3758/s13428-015-0647-3 Ji
          <string-name>
            <given-names>Z</given-names>
            ,
            <surname>Lu</surname>
          </string-name>
          <string-name>
            <surname>B</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>Gender Classification Based on Support Vector Machine with Automatic Confidence</article-title>
          .
          <source>Neural Comput 685-692.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Lieto</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radicioni</surname>
            <given-names>D.P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rho</surname>
            <given-names>V</given-names>
          </string-name>
          , (
          <year>2017</year>
          )
          <article-title>Dual-PECCS: A Cognitive System for Conceptual Representation and Categorization</article-title>
          ,
          <source>Journal of Experimental and Theoretical Artificial Intelligence</source>
          , Vol.
          <volume>29</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>433</fpage>
          -
          <lpage>452</lpage>
          . https://doi.org/10.1080/0952813X.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Lieto</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lebiere</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oltramari</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>The Knowledge Level in Cognitive Architectures : Current Limitations and Possible Developments</article-title>
          .
          <source>Cognitive Systems Research</source>
          , Vol.
          <volume>48</volume>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>55</lpage>
          . doi: https://doi.org/10.1016/j.cogsys.
          <year>2017</year>
          .
          <volume>05</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Medin</surname>
            <given-names>DL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaffer</surname>
            <given-names>MM</given-names>
          </string-name>
          (
          <year>1978</year>
          )
          <article-title>Context Theory of Classification Learning</article-title>
          .
          <source>Psychol Rev</source>
          <volume>85</volume>
          :
          <fpage>207</fpage>
          -
          <lpage>238</lpage>
          . doi:
          <volume>10</volume>
          .1037/
          <fpage>0033</fpage>
          -
          <lpage>295X</lpage>
          .
          <year>85</year>
          .3.207
          <string-name>
            <surname>Smith</surname>
            <given-names>JD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minda</surname>
            <given-names>JP</given-names>
          </string-name>
          (
          <year>1998</year>
          )
          <article-title>Prototypes in the Mist : The Early Epochs of Category Learning</article-title>
          .
          <volume>24</volume>
          :
          <fpage>1411</fpage>
          -
          <lpage>1436</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>