Crowdsourcing Image Schemas
                      Dagmar GROMANN a Jamie C. MACBETH b,1 ,
         a TU Dresden, International Center for Computational Logic, Germany
b Department of Computer Science, Smith College, Northampton, Massachusetts, USA


            Abstract. With their potential to map experiental structures from the sensorimotor
            to the abstract cognitive realm, image schemas are believed to provide an embodied
            grounding to our cognitive conceptual system, including natural language. Few em-
            pirical studies have evaluated humans’ intuitive understanding of image schemas
            or the coherence of image-schematic annotations of natural language. In this paper
            we present the results of a human-subjects study in which 100 participants annotate
            12 simple English sentences with one or more image schemas. We find that human
            subjects recruited from a crowdsourcing platform can understand image schema
            descriptions and use them to perform annotations of texts, but also that in many
            cases multiple image schema annotations apply to the same simple sentence, a phe-
            nomenon we call image schema collocations. This study carries implications both
            for methodologies of future studies of image schemas, and for the inexpensive and
            efficient creation of large text corpora with image schema annotations.
            Keywords. Crowdsourcing, natural language annotation, image schema, natural
            language understanding, cognitive linguistics.


1. Introduction

Image schemas offer one possible explanation for the transition from perception to mean-
ing [10]. Studies have shown that even abstract concepts are grounded in our sensori-
motor experiences (e.g. [24]). When people talk or think about a “chair” it is associated
with a simulation of the movement of “easing into a chair” and its associated multimodal
representations (e.g. how a chair looks and feels, etc.), which leads to a slight neural ac-
tivation in the respective motion areas [2]. A similar activation can be observed when we
think about a more abstract idiom, e.g. “playing first chair”. The main assumption here
is that natural language, whether abstract or concrete, is grounded in image schemas.
     A significant segment of the cognitive linguistics community argued that the concep-
tual structure derived from our sensorimotor experiences and episodic memories shapes
semantic structure of natural language (see e.g. [22,26]). One theory that aims at cap-
turing this conceptual structure arising from our bodily sensations is the theory of im-
age schemas, introduced by Lakoff [17] and Johnson [13] within the paradigm of em-
bodied cognition. Image schemas are internally structured, that is, composed by a few
spatial primitives that make up more complex image schemas and schematic integra-
tions [10,22,12]. While their existence in natural language has been studied by means
  1 Corresponding Author: Dagmar Gromann, TU Dresden, International Center for Computational Logic,

Dresden, Germany; E-mail: dagmar.gromann@gmail.com.
of corpus-based (e.g. [23,24]) and machine learning methods (e.g. [8,9]), few empirical
studies with human subjects have been proposed [4,7].
      The majority of image-schematic experimental setups involved visual stimuli or ma-
terials rather than descriptions of image schemas (e.g. [6,23]), asking subjects to narrate
(e.g. [3]) or visualize what is being narrated (e.g. [6]). One exception is that of Gibbs et
al. [7] who asked students to map image schemas to bodily experiences of physical exer-
cises they performed, which required the description of image schemas to participating
students [7]. The work closest to ours is Cienki’s [4], who asked participants to annotate
videos with image schemas. We draw inspirations from these previous descriptions of
image schemas in our annotation task [7,4].
      In this paper, we contribute a study to test image schemas on a significant number
of human participants, that is, one hundred, to determine the coherence between imagis-
tic aspects and lexical representations, and to study methods for future investigations on
connecting image schemas to language. Our study presents natural language sentences to
human subjects and gives them the tasks of identifying image schemas within their un-
derstanding of the sentences and annotating the sentences with specific image schemas.
We use crowdsourcing as a time-efficient and low-cost method of obtaining large num-
ber of these image schema annotations. We evaluate its utility for image schema annota-
tions of natural language sentences that may be used as data in future studies, or for ma-
chine learning-based natural language processing with image schemas. This evaluation
includes the analysis of inter-rater reliabilities as well as natural language justifications
of selections made by participants in the study. We found that human subjects recruited
from a crowdsourcing platform could perform the annotation task, but the annotation
cannot be treated as a simple classification, since annotations for many sentences resulted
in collocations of image schemas.


2. Crowdsourced Human Participants Study

We performed a study of human subjects in which they read simple English sentences
and matched their conceptualizations of the meanings of the sentences to a number of
image schemas. To the best of our knowledge, this is the first study of this kind on im-
age schemas, and the first using crowdsourcing. Thus, we had to look to other studies of
highly-abstract cognitive building blocks for guidance on how to set up such a task. To
this end, study of crowdsourcing conceptual primitives of Schank’s Conceptual Depen-
dency theory [20,25] was consulted, from which we took the sentences used in this exper-
iment and provided in Table 2. A similarity between Conceptual Dependency primitives
and image schemas has been shown in previous comparative studies [19].

2.1. Selecting Image Schemas

In the classic exposition of image schemas, Johnson [13] provided in total 29 to which
Lakoff [17] added several more, and, over the years, many additional image schemas,
such as S UPPORT [21], have been proposed. In order not to overwhelm participants of
this study, we selected a subset of image schemas appearing in the literature, restricting
to those that are well specified and equipped with concise descriptions.
     Several image schemas initially proposed by Johnson were very general and diffi-
cult to grasp without detailed explanations (e.g. P ROCESS). Meanwhile others, such as
E NABLEMENT as one of the F ORCE family of schemas, were highly specific. In order to
provide a more balanced account of image schemas to participants in this study, we fol-
low the lead of Cienki [4] and select image schemas with a moderate level of specificity.
Furthermore, since this study requires the formulation of self-explanatory descriptions
of utilized image schemas to non-experts, we limited our study to those for which simple
and complete descriptions are available in the literature. In a previous comparison of im-
age schemas and Conceptual Dependency primitives, the target sentences of this study
are implicitly annotated with image schemas by three experts [19]. This availability of
annotations was another factor influencing our selection strategy. The final set of selected
image schemas is provided in Table 1 alongside their descriptions.

2.2. Explaining Image Schemas

Crowdsourcing the annotation of sentences with image schemas requires the description
of image schemas to non-expert participants. This experiment assumes and tests a cer-
tain degree of coherence in the imagistic aspects of lexical representations. To this end,
the main question to be answered is whether subjects agree on the mapping of image
schemas to linguistic expressions.
     Linguistically motivated analyses of image schemas have rarely involved the de-
scription of image schemas and spatial primitives to human subjects. As one of the few
exceptions, Gibbs et al. [7] presented 22 students with brief descriptions of 12 different
image schemas after having conducted several bodily exercises to represent stand, who
were asked to rate the degree of relatedness between schema and exercise on a scale
of one to seven. The predominant schemas were: BALANCE, V ERTICALITY, C ENTER -
P ERIPHERY, R ESISTANCE, L INK. The same descriptions were re-used in two similar
related experiments. The descriptions used are strongly related to the active bodily expe-
rience of motion that subjects undergo within the experiment. For instance, C ONTAIN -
MENT is described as follows: “Container refers to the experience of boundedness and
enclosure. As you stand there, do you feel a sense of container?” [7].
     Cienki [4] conducted an experiment with 80 students who annotated gestures in
videos with image schemas, which required the description of image schemas to the
participants. We adapted Cienki’s set of image schema descriptions where available to
our purpose and the task of text annotation, and provided them as they are represented
in Table 1 to the participants of this study. In addition, we provided participants with
the option to specify their own category to annotate the sentences by using the option
“OTHER”.

2.3. Crowdsourcing the Study

The study was performed on the Amazon Mechanical Turk crowdsourcing platform2 .
It was posted as a set of Mechanical Turk human intelligence tasks (HIT) and adver-
tised as “a survey on language and sentence categorization” for both “masters” level and
“non-masters” workers. Participants performed the study remotely and entirely through
Mechanical Turk by accepting the task and filling out an HTML form on a Web page
on the Turk platform. We conducted two smaller pilot studies (with 12 and 10 partici-
pants respectively) to ensure the validity of our task, in particular the intelligibility of our
  2 http://www.mturk.com
                   Table 1. Image schemas and their descriptions as used in the study.
        Word              Description
        C ONTAINER        A container has a boundary that separates an inside from an outside. It
                          can hold things. We can be contained (for example, in a room), and our
                          own bodies are containers [4].
        PATH              A path is a route for moving from a starting point to an end point. We or
                          a thing can follow an existing path, or make a path with our or its own
                          movement (adapted from [4]).
        S UPPORT          Contact between two objects in the vertical dimension [21]. For in-
                          stance, a book can be supported by a table when the book is in contact
                          with the table’s surface.
        F ORCE            Force usually implies the exertion of physical strength in one or more
                          directions. We can experience force in terms of compulsion, blockage,
                          or enablement [4].
        PART-W HOLE       Part-whole describes whole(s) consisting of parts and a configuration
                          of the parts. Our bodies can be seen as a whole with several parts. An
                          object can be a whole with many parts (adapted from [17] ).
        OTHER             If none of the categories seems appropriate, select this category (6) to
                          signify “other” and detail in your explanation a new category that you
                          think would better fit the sentence.


descriptions. Only workers who had an overall HIT approval rate greater than or equal
to 95% and more that 1000 approved HITs were allowed to perform the task. The HIT
was set up such that any unique Turk worker could only perform it once. A total of 100
participants took part in the study.
     The HTML crowdsourcing interface to the study had descriptions of the five image
schemas and the “Other” category. The image schema descriptions were followed by
the texts of the twelve target sentences (see Table 2). Along with each target sentence
were six checkboxes and six text input areas, each corresponding to the image schemas
and “Other”. The interface instructed participants to check one or more boxes to identify
the image schemas (described above) that matched the meaning of the sentence. For
each image schema box that participants checked, they were asked to provide a short
explanation in the corresponding text input area (of at least one sentence) for why the
image schema was a match to the sentence. A “submit” button in the HTML interface
uploaded participants’ answers to Mechanical Turk.
     Because the study was performed entirely online without any in-person interactions,
one challenge was to determine whether participants were honestly and sincerely per-
forming the requested task or just filling in the form randomly in exchange for pay-
ment. We rejected HIT submissions from several participants based on the comments
that they made in explanation of their answers. For example, for the sentence “Jim held
on to the railing”, one participant checked the boxes for PATH, S UPPORT, and F ORCE,
and left the explanations “he was going up the stairs”, “he was going down the stairs”,
and “he thought he was going to fall” respectively; this subject and subjects who gave
similar nonsensical answers were rejected. Others were rejected for leaving nonsensical
comments that we determined were not in any way connected to the image schema and
conceptualizing the sentence. They may have been due to not understanding the task or
not understanding the explanations of the categories. For example, a different participant
left explanations such as “one idea here”, and “he did one entire thing” to justify PART-
W HOLE for different sentences. Submissions were also rejected when it was clear that
their comments and selected image schemas did not match. Still others were rejected for
having repeated identical answers for each sentence (e.g. one participant checked PART-
W HOLE for each sentence and answered “humans have many parts” repeatedly for each,
including a sentence about a gecko). Several incomplete submissions and submissions
that were obviously computer generated were also rejected. Finally, we rejected seven
submissions which were perfect duplicates in terms of their answers but were submitted
under different Turk worker IDs. This evaluation process of participant’s answers left us
with a total of 100 participants as opposed to the 120 initially submitting participants,
whose answers we analyze more closely in the results section.
      As a further evaluation step, we were interested in the agreement among respon-
dents. Each respondent of this crowdsourcing task was presented with multiple possi-
ble categories representing image schemas to annotate a specific given sentence, which
means answers do not fall into one of several mutually exclusive categories. Resulting
multi-response data cannot be analyzed with the traditional Pearson Chi-squared tests
for independence due to within-subject dependence among responses. As a result, we
opted for a kappa-based inter-rater reliability measure on the participants’ responses. In
general, statistics such as Krippendorff’s alpha [16] depend on mutually exclusive cate-
gorical ratings. Kraemer [15] relaxes this constraint and uses rank order statistics to deal
with the case where a rater can mark a subject with more than one category [1]. This,
however, requires processing our multi-response to rank ordered data. Fleiss suggests a
relaxation of the original kappa measures to allow for multiple raters for categorical data,
which can be calculated on a per category-basis for non-mutually exclusive classification
tasks as ours [5]. This per-category calculation of inter-rater reliability could in our study
first be applied to image schemas and second to sentences.


3. Results

We generally found that study participants were able to perform the task of comprehend-
ing the image schema descriptions and connecting them to their interpretation of the tar-
get sentences. We only rejected 5 participants out of 120 who were honestly attempting
to perform the task, but demonstrated clear difficulties in understanding or doing it—we
rejected 15 other responses which were malicious, computer generated, or clearly not
attempting the task for other reasons.

3.1. Statistical Results

Statistical results of the answers of the 100 participants are provided in Table 2, which
represents the selected image schemas for each sentence as well as the percentage of
people who selected this category since we have exactly 100 participants. However, a
participant could select more than one image schema per sentence. The highlighted num-
bers in bold represent the highest number for each row corresponding to the per-sentence
annotations. For instance, in the first sentence (sentence 1), the image schema S UPPORT
was selected 92 times, which represents the highest number of selections in this row and
thus is highlighted in bold. We consider this highlighted number the predominant image
schema for this sentence. The last two columns represent how many participants selected
one category only or more than one for each sentence. For instance, for sentence 6 a total
of 81 people (81%) selected only one image schema while only 19 participants selected
more than one. Finally, the last row, the total average, presents the total number of times
an image schema was selected across sentences divided by the overall total number of
selections in the task. Here, the most frequently selected image schema in the task turns
out to be F ORCE followed by PATH. The numbering of the sentences represent the order
of presentation to the study’s participants.

Table 2. The 12 target sentences structured by predominant image schemas in the result (also highlighted in
bold) and subject’s percentage of agreement on the individual sentences. The image schemas are C ONTAIN -
MENT (C), PATH (P), S UPPORT (S), F ORCE (F), PART-W HOLE (PW), and OTHER (O). Columns “1” and
“2+” represent the number of participants who selected only one or more than one category for the sentence
respectively.

  Sentence                                        C      P       S      F      PW      O      1      2+

  Majority S UPPORT

  1. “Jim held on to the railing.”                3      14      92     36     11      1      61     39

  5. “The gecko stuck to the wall.”               8      3       80     31     10      7      69     31

  Majority F ORCE

  2. “Lisa kicked the ball.”                      5      34      2      100    9       -      62     38

  4. “Michelle threw up her lunch.”               44     33      1      80     15      3      51     49

  8. “Bill was hit by a car.”                     6      27      3      96     7       2      70     30

  12. “Joe swung his fist at David.”              3      36      2      91     37      4      47     53

  Majority PATH

  3. “Matthew flew home from Los Angeles.”        43     95      8      15     3       -      56     44

  6. “Robert returned home from downtown.”        7      99      1      6      9       1      81     19

  Majority C ONTAINMENT & PART-W HOLE

  7. “Charles ate a hamburger.”                   59     20      3      42     19      14     64     36

  9. “Amy took a deep breath.”                    57     19      5      47     28      7      59     41

  10. “Stephanie bled from a cut on her leg.”     41     33      1      25     56      7      58     42

  11. “Kevin crossed his arms.”                   4      17      19     32     69      4      67     33

  Total Average (in %)                            15     23      12     32     15      3      62     38


     The relative frequencies in Table 2 show that for several sentences one specific image
schema turned out to be predominant (highlighted in bold). For instance, 100% of all
participants selected F ORCE for the second sentence, “Lisa kicked the ball”. However,
for some sentences, no particular image schema was dominant; sentences 7, 9, 10, and
11 show a stronger distribution of answers across up to three image schemas. To provide
a deeper understanding of these results, we analyze the participant’s explanations offered
alongside their annotations.

3.2. Interpreting Natural Language Explanations

In this section, we discuss a first interpretation of natural language explanations provided
by participants to justify their selection of image schema(s) for a specific sentence. Our
discussion is guided by the grouping of sentences by image schema in Table 2 and based
on a codification of the results into categories generated in the annotation process. For
instance, we used categories such as “body is a container” to classify all explanations
that justify their C ONTAINMENT selection in this way, e.g. “Amy’s body is a container
for the air she inhaled” in reference to sentence 9. Our focus is on the most frequently
selected image schemas in each group.
     Explanations in the first group of S UPPORT are highly homogenous with all partici-
pants in both sentences agreeing that the predominant schema is S UPPORT because there
is a contact to an object (railing, wall) that provides support. The type of F ORCE applied
is considered physical strength of the body, human or gecko, by all participants. The
PART-W HOLE explanations are the most varied where half of the participants consider
body parts (hand of Jim or the gecko) as part of the whole, the body. The remaining half
in sentence 1 refers to the railing being part of a path or a ship, while the remaining half
in sentence 5 refers to the gecko as a complex being composed of parts. For PATH in
sentence 1, the majority considered the railing itself to constitute some kind of path and
only four stated it is a person moving along a path.
     In the second group the predominant schema is F ORCE, for which the explanations
in each of the four sentences corresponds. In sentences 2, 4, and 12 the selection is
attributed to physical strength applied by the actor or their body parts in the described
situation, whereas in sentence 8 the strength is assigned to the car that hit the actor. The
second significant schema here is PATH, which for sentences 2, 4, and 8 is attributed to
an object moving (ball, lunch, car) along a path, while in sentence 12 it is a body part
that moves (the fist). The frequently selected C ONTAINER in sentence 4 is justified with
Michelle’s body or her stomach.
     As regards the group of predominantly PATH-annotated sentences, the explanations
mainly refer to a person moving along a path with some exceptions that describe the
path as abstract entity without detailing the object or person moving. The C ONTAINER
in sentence 3 is in all but 2 cases the airplane, where the two cases refer to the actor
being a container himself. The agreement in explanation for this group is reflected in the
κ values below.
     In the last group the most varied selection of categories and explanations can be
found, which requires a more detailed a analysis. For sentences 7, 9, and 10 annotators
agree that the body (or its parts, such as the lungs) functions as C ONTAINER. This act
of becoming contained (hamburger, air) is associated with F ORCE. This association with
force is considerably more frequent with the expulsion of the containee in 4. The anno-
tators selecting PATH in sentences 7,9 and 10 considered the way from the outside to the
inside of the container as traveling along a PATH, much in line with a formalization of
the C ONTAINMENT scheme [11]. For PART-W HOLE in Sentence 7, the annotators pro-
vide different answers, that is, 3 stating the hamburger has parts, 7 considering the ham-
burger to become part of the whole of Charles, and 9 stating that Charles used parts of
his body (mostly mouth, 2 say arms) to eat the hamburger. In the explanations of the 14
times that “Other” was selected, people mostly suggested an additional category called
“consumption” or “energy”.
     When we examine which image schemas took second place in the sentence annota-
tions, “Amy took a deep breath” had the highest second-place percentage. In this case the
F ORCE image schema, with 47%, came in second place only to C ONTAINMENT, with
57%. All F ORCE explanations are related to physical strength, where a small proportion
of subjects explicitly assigns the force of inhaling to the body part lungs. All annota-
tors selecting C ONTAINMENT referred to the body or body part as C ONTAINER. Finally,
annotators differentiated between body parts being part of a whole or the air becoming
part of the body when selecting PART-W HOLE. For “Other” people mostly suggested a
new category of “living being” or only stated that none of the other categories fit in their
mind.
     A majority of participants marked PART-W HOLE as a match for the sentence
“Stephanie bled from a cut on her leg”, while nearly two-thirds (69) did for “Kevin
crossed his arms”. For sentence 10 the predominant justification is that the leg is a part
of the body, whereas in sentence 11 the predominant argumentation is that person is a
whole with many parts. While describing two different perspectives, the underlying idea
matches. For 19 of 33 participants, C ONTAINMENT and PATH belong together in sen-
tence 10 since the blood leaves the C ONTAINER along a path. The remaining 14 partici-
pants selected PATH but not C ONTAINMENT and state that an object (blood) travels along
a PATH leaving. The 25 annotators selecting F ORCE in Sentence 10 explained that it is
outer forces pulling blood out (3), such as gravity, that bodily functions force blood out
(11), or that F ORCE had to be applied in order for the cut to come into existence (11). To
summarize, participants to some degree agree that the container of her body/leg is force-
fully disrupted by a cut that causes parts of the containee to leave the container along
a path. In the explanations of “Other” people suggest the introduction of an additional
category of “involuntary action” or “injury” because of the cut.
     In case of Sentence 11, arms are considered to move along a path (17) and by in-
terlinking them they create a support structure with the chest (19). In addition, 30 of 32
participants stated that it takes physical F ORCE to cross one’s arms, while 2 stated that
the resulting posture is a defiant and forceful one.

3.3. Inter-Rater Reliability

To statistically measure agreement in our data, we calculate the inter-rater reliability
based on the kappa proposed by Fleiss [5]. Since more than one category can be selected
for each sentence and we have multiple raters, we decided to calculate this kappa value
on an image schema basis represented in Table 3. Each column represents one image
schema and row one the Fleiss kappa calculated on all sentences. The highest agreement
can be found for S UPPORT followed by F ORCE and the lowest on PART-W HOLE.

                 C ONTAINER      PATH     S UPPORT     F ORCE    PART-W HOLE    all
            κ       0.268        0.357       0.639       0.392        0.223    0.266
                               Table 3. Fleiss’ kappa per image schema


     While we believe that those results are interesting in terms of understandability
of individual image schemas, it might also be the case that the design as multiclass-
classification problem negatively impacts the results since one rater could select more
than one category. To account for this problem we apply Kraemar’s method [15] to turn
our multiclass-classifications into a ranked ordinal set and then calculate the Kendall rank
correlation coefficient [14], which returns a correlation of 0.405 as a second measure for
comparison and amoderate agreement for the whole dataset.

          1      2      3      4       5       6       7       8       9     10     11     12
    κ    0.51   0.66   0.56   0.35    0.40    0.76    0.17    0.62    0.18   0.17   0.26   0.48
                                Table 4. Fleiss’ kappa per sentence


     To complete the calculation of agreement in our dataset, we adapt the per-category
measurement of Fleiss’ kappa to a sentential level depicted in Table 4. This gives a con-
siderably higher agreement than the per-image schema basis. Sentences with the lowest
values are sentences 7, 9, 10, and 11. These are all grouped into the C ONTAINER and
PART-W HOLE group in Table 2 and are the ones with the strongest indication for image
schema collocations. Highest agreements are achieved for sentences were all (sentence
2) or almost all (sentence 6) participants select a specific image schema. In fact, the four
sentences with the highest agreement (sentences 6, 2, 8, 3 in order of κ score) obtain
the highest number of selections for a single image schema (category per sentence) and
are annotated with the most frequently selected image schemas, that is, F ORCE (32%)
and PATH (23%). It seems that the sentences in the last group of Table 2 are the most
controversial and most difficult to annotate.


4. Discussion

The design of our crowdsourcing task allowed participants to select more than one im-
age schema per sentence, which was a frequently utilized option. One possible expla-
nation for the annotation of a sentence with multiple image schemas is that participants
conceived a conceptual collocation of image schemas, which we can confirm from the
explanations. For instance, “Amy took a deep breath” was perceived as combination of
C ONTAINMENT and F ORCE in most cases, where the air enters the container represented
by Amy’s lungs or her body and the moving of in the body requires F ORCE. Such move-
ments in and out the body are frequently also collocated with PATH along which the
objects (air, food, people if the container is an object, etc.) enter or leave the container.
An option to explicitly indicate the semantics of a sentence with a collocation of image
schemas was not considered in this task, but would be important future work.
     To quantify this phenomenon, on average 38% of all annotators selected more than
one image schema per sentence (see Table 2). This suggests that specific sentences are
perceived as grounded in a collocation of image schemas. This has implications for the
utilization of crowdsourcing to obtain large corpora of annotated natural language sen-
tences, namely that participants of such studies should be given the option to select more
than one image schema for their annotation of natural language sentences. A large-scale
dataset with annotated image schemas could be highly useful to boost the field of image
schemas and the utilization of machine and deep learning applications.
     As an additional comparison of our study, we evaluate the results in the light of
previous annotations of the same sentences by experts provided in Macbeth et al. [19].
In that previous study, image schemas were mapped to Conceptual Dependency primi-
tives utilizing the same set of sentences of the study presented herein, which indirectly
provided an expert annotation of those sentences. Experts and crowd agree on the sen-
tences primarily annotated with S UPPORT (sentence 1 and 5), PATH (sentences 3 and
6), and F ORCE (sentences 2 and 8). Experts and crowd also agree on a stronger image
schema collocation in sentence 4, namely a combined annotation of PATH, C ONTAINER,
and F ORCE. For sentence 12, experts see PATH and PART-W HOLE as the predominant
schemas, while the crowd mainly annotated the sentence with F ORCE, followed by PATH
and PART-W HOLE. Finally, for the last group of sentences in Table 2 the experts anno-
tate sentences 7, 9, and 10 with C ONTAINER, PATH, and F ORCE and sentence 11 with
PATH and PART-W HOLE. In sentences 9 and 10, the crowd is less interested in PATH and
for sentence 10 the predominant image schema is PART-W HOLE. The schemas selected
by the experts are also considered by the crowd, but with less importance. Sentence 11
shows an agreement in PART-W HOLE, but not in PATH, which is assigned considerably
less importance by the crowd than F ORCE. To sum up this comparison, the biggest dis-
crepancies can be seen in sentence 12 were the experts ignore F ORCE and in the sen-
tences in the C ONTAINMENT and PART-W HOLE group in Table 2, where PATH and
PART-W HOLE are assigned different significances by the two groups. Nevertheless, the
overall agreement of experts and crowd provides further validity to the task and setup of
this crowdsourcing study, even though this point has to be subjected to further large-scale
crowdsourcing experiments with more and more varied sentences.
     A discussion on crowdsourcing image schemas would not be complete without ex-
plicitly addressing several lessons learned. In this study, a comparatively small corpus of
sentences was annotated to ensure a significant number of annotations per sentence. This
can potentially inhibit generalizations to some extent. Furthermore, the chosen set of sen-
tences mainly describes concrete sensorimotor experiences that participants might have
experienced themselves at one point or another. It is desirable to repeat the experiment
with a larger and more abstract set of sentences. The task setup could be improved with
a view to facilitating the evaluation of the obtained results. Instead of allowing for multi-
ple selections, a ranking of the answers by most important to least important schema for
a specific sentence could considerably facilitate statistical processing and provide more
expressive annotations. One potential method to this end could be best-worst scaling [18]
or simple ranking of chosen image schemas for each individual sentence. However, we
believe that a more substantial change of task is needed to truly account for an expressive
annotation of perceived image schema collocations. This could be done by explicitly al-
lowing participants to describe the semantics of a sentence by a combination of image
schemas, which could lead to very interesting results.
     From the detailed analysis of the results in this study several implications follow.
It can be safely stated that image schemas turn out to be useful heuristics for the inter-
pretation of natural language sentences, for experts as much as non-experts. A certain
degree of agreement in annotations (between crowd workers and between crowd and ex-
perts) shows that image schema annotations of natural language can also be performed
without a cognitive linguistic background. This agreement also implicitly validates our
natural language descriptions of image schemas, since a sufficient and homogeneous un-
derstanding of these descriptions is required to reach any agreement on the annotations.
Those validated descriptions mark one major contribution of this paper, since it is gen-
erally perceived as difficult to describe abstract cognitive building blocks to non-experts.
Nevertheless, the proposed set of descriptions could benefit from further experiments and
especially extensions, since it only covers five image schemas in its current state.


5. Conclusion

Empirical studies of image schemas involving human subjects have been challenging
due to their highly abstract nature, and only few studies have attempted to explain image
schemas to non-experts—a prerequisite for those types of tasks. To the best of our knowl-
edge, this is the first study that proposes the use of crowdsourcing to annotate natural
language sentences with image schemas, which also required the explanation of image
schemas to naive subjects. A high agreement of expert and crowd annotations acts in
favor of the proposed method for image schema annotations of natural language.
     Our results show collocations of image schemas for individual sentences, that is,
multiple image schemas are chosen for each sentence, which has ramifications for using
crowdsourcing to gather labeled training data for machine learning. While the current set
of sentences is restricted to ensure sufficient annotations for each sentence, it still shows
a high agreement of annotators regarding the image-schematic content of sentences. The
results also hint at certain common combinations of image schemas, such as S UPPORT
and F ORCE that, however, require further large-scale investigations to allow for gener-
alization. In addition, we envision to extend the type of sentences to be annotated from
strictly physical movements to more abstract ones and test the annotation task on crowds
of different languages.


References

 [1]   Mousumi Banerjee, Michelle Capozzoli, Laura McSweeney, and Debajyoti Sinha. Beyond kappa: A
       review of interrater agreement measures. Canadian Journal of Statistics, 27(1):3–23, 1999.
 [2]   Lawrence W Barsalou. Grounded cognition. Annu. Rev. Psychol., 59:617–645, 2008.
 [3]   Daniel Casasanto and Sandra Lozano. The meaning of metaphorical gestures. Metaphor and Gesture.
       Gesture studies, Amsterdam, the Netherlands: John Benjamins Publishing.(date of access: 9 Dec. 2012),
       2007.
 [4]   Alan Cienki. Image schemas and gesture. From perception to meaning: Image schemas in cognitive
       linguistics, 29:421–442, 2005.
 [5]   Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin,
       76(5):378, 1971.
 [6]   Orly Fuhrman, Kelly McCormick, Eva Chen, Heidi Jiang, Dingfang Shu, Shuaimei Mao, and Lera
       Boroditsky. How linguistic and cultural forces shape conceptions of time: English and Mandarin time
       in 3D. Cognitive science, 35(7):1305–1328, 2011.
 [7]   Raymond W Gibbs Jr, Dinara A Beitel, Michael Harrington, and Paul E Sanders. Taking a stand on the
       meanings of stand: Bodily experience as motivation for polysemy. Journal of Semantics, 11(4):231–251,
       1994.
 [8]   Dagmar Gromann and Maria M Hedblom. Body-mind-language: Multilingual knowledge extraction
       based on embodied cognition. In AIC, pages 20–33, 2017.
 [9]   Dagmar Gromann and Maria M. Hedblom. Kinesthetic mind reader: A method to identify image
       schemas in natural language. In Proceedings of Advancements in Cognitive Systems, 2017.
[10]   Beate Hampe. Image schemas in cognitive linguistics: Introduction. From perception to meaning: Image
       schemas in cognitive linguistics, 29:1–14, 2005.
[11]   Maria M. Hedblom, Dagmar Gromann, and Oliver Kutz. In, out and through: Formalising some dynamic
       aspects of the image schema containment. In Stefano Bistarelli, Martine Ceberio, Francesco Santini,
       and Eric Monfroy, editors, Proceedings of the Knowledge Representation and Reasoning Track(KRR) at
       the Symposium of Applied Computing (SAC), pages 918–925, 2018.
[12]   Maria M. Hedblom, Oliver Kutz, and Fabian Neuhaus. Choosing the Right Path: Image Schema Theory
       as a Foundation for Concept Invention. Journal of Artificial General Intelligence, 6(1):22–54, 2015.
[13]   Mark Johnson. The Body in the Mind: The Bodily Basis of Meaning, Imagination, and Reason. The
       University of Chicago Press, Chicago and London, 1987.
[14]   Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
[15]   Helena Chmura Kraemer. Extension of the kappa coefficient. Biometrics, pages 207–216, 1980.
[16]   Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recommenda-
       tions. Human communication research, 30(3):411–433, 2004.
[17]   George Lakoff. Women, Fire, and Dangerous Things. What Categories Reveal about the Mind. The
       University of Chicago Press, 1987.
[18]   Jordan J Louviere and George G Woodworth. Best-worst scaling: A model for the largest difference
       judgments. University of Alberta: Working Paper, 1991.
[19]   Jamie Macbeth, Dagmar Gromann, and Maria M. Hedblom. Image schemas and conceptual dependency
       primitives: A comparison. In Proceedings of the Joint Ontology Workshop (JOWO). CEUR, 2017.
[20]   Jamie C. Macbeth and Marydjina Barionnette. The coherence of conceptual primitives. In Proceedings
       of the Fourth Annual Conference on Advances in Cognitive Systems. The Cognitive Systems Foundation,
       June 2016.
[21]   Jean M Mandler. How to build a baby: II. conceptual primitives. Psychological review, 99(4):587, 1992.
[22]   Jean M. Mandler and Cristóbal Pagán Cánovas. On defining image schemas. Language and Cognition,
       pages 1–23, 2014.
[23]   Anna Papafragou, Christine Massey, and Lila Gleitman. When English proposes what Greek presup-
       poses: The cross-linguistic encoding of motion events. Cognition, 98(3):B75–B87, 2006.
[24]   Juan Antonio Prieto Velasco and Maribel Tercedor Sánchez. The embodied nature of medical concepts:
       image schemas and language for pain. Cognitive processing, 1 2014.
[25]   Roger C Schank. Conceptual Information Processing. Elsevier, New York, NY, 1975.
[26]   Leonard Talmy. The fundamental system of spatial schemas in language. In Beate Hampe and Joseph E
       Grady, editors, From perception to meaning: Image schemas in cognitive linguistics, volume 29 of Cog-
       nitive Linguistics Research, pages 199–234. Walter de Gruyter, 2005.