Predicting the Concreteness of German Words


                      Jean Charbonnier and Christian Wartena
                                Hochschule Hannover
                       Expo Plaza 12, 30539 Hannover, Germany
           {jean.charbonnier, christian.wartena}@hs-hannover.de


                        Abstract                                 tity, but found that subjects largely rated the haptic
                                                                 and visual experiences, even if they were explicitly
    Concreteness of words has been measured                      asked to take into account experiences involving
    and used in psycholinguistics already for                    any senses.
    decades. Recently, it is also used in re-                       Concreteness seems to play an important role in
    trieval and NLP tasks. For English a num-                    human language processing (Borghi et al., 2017).
    ber of well known datasets has been estab-                   Concreteness also has been used for various compu-
    lished with average values for perceived                     tational linguistic tasks like detection of metaphors
    concreteness. We give an overview of                         and non-literal language (Turney et al., 2011; Hill
    available datasets for German, their cor-                    and Korhonen, 2014; Frassinelli and Schulte im
    relation and evaluate prediction algorithms                  Walde, 2019), lexical simplification (Jauhar and
    for concreteness of German words. We                         Specia, 2012), multimodal retrieval (Hessel et al.,
    show that these algorithms achieve similar                   2018) or estimating the stability of word embed-
    results as for English datasets. Moreover,                   dings (Pierrejean and Tanguy, 2019).
    we show for all datasets there are no sig-
                                                                    Traditionally, word norms are obtained by asking
    nificant differences between a prediction
                                                                 subjects to estimate the value for each property on
    model based on a regression model using
                                                                 a Likert scale. Recently, also various approaches
    word embeddings as features and a predic-
                                                                 have been proposed to predict the concreteness of
    tion algorithm based on word similarity
                                                                 words. On three different datasets we will test two
    according to the same embeddings.
                                                                 algorithms that have given very good results for En-
1    Motivation                                                  glish data and compare the results in section 4 after
                                                                 we have discussed the most common approaches
A number of properties of words, mainly of seman-                to predict word concreteness (section 2) and pre-
tic nature, have been studied and used in psycholin-             sented the concreteness data available for German
guistic research for decades. These properties often             (section 3).
referred to as (affective) word norms, include con-
creteness, imagery1 , age of acquisition, valence,               2     Related work
and arousal. In the present work we will focus on
concreteness. Friendly et al. (1982) define concrete             We find basically three approaches to predict the
words as words that “refer to tangible objects, ma-              concreteness of a word: (1) adopting the concrete-
terials or persons which can be easily perceived                 ness value from similar, related or neighboring
with the senses”. Similarly, Brysbaert et al. (2014)             words; (2) identifying a dimension in word embed-
define concreteness as the degree to which the con-              dings that corresponds to concreteness; (3) training
cept denoted by a word refers to a perceptible en-               a regression model on features of words

Copyright c 2020 for this paper by its authors. Use permitted    2.1    Adopting concreteness of related words
under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0)                                               Liu et al. (2014) predict values for imagery, a word
    1 Most authors seem to use the term imagery, while others
                                                                 norm that strongly correlates with concreteness, by
also use imageability and visualness. In German the term Bild-
haftigkeit is the most common one, while also Vorstellbarkeit    using the values from synonyms and hypernyms
is found. We will use imagery throughout this paper.             found in WordNet.
   Rabinovich et al. (2018) predict the concreteness             2.2   Concreteness in word embeddings
of words indirectly by assigning a concreteness
                                                                 Rothe et al. (2016) try to find low-dimensional fea-
value to sentences in which a word occurs. The
                                                                 ture representations of words in which at least some
concreteness value of a sentence is based on the
                                                                 dimensions correspond to interpretable properties
presence of seed words. The set of seed words is
                                                                 of words. One of these dimensions is concreteness.
constructed by selecting words with derivational
                                                                 For training and testing they use Google News em-
suffixes that are typical for highly abstract nouns.
                                                                 beddings and two subsets of frequent words from
The correlation between the predicted values and
                                                                 the norms of Brysbaert et al. (2014). For their test
the manual assigned values from various subsets
                                                                 set of 8,694 frequent words they found a moderate
of the dataset from Brysbaert et al. (2014) and the
                                                                 correlation with the human judgments (Kendall’s
4,295 concreteness values2 from the MRC (Medi-
                                                                 τ = 0, 623). Similarly, Hollis and Westbury (2016)
cal Research Council) Psycholinguistic Database
                                                                 looked which dimension of word embeddings cor-
(Coltheart, 1981) ranges from 0.66 to 0.74.
                                                                 relate to one of the classical word norms. They
   Turney et al. (2011) compute the degree of con-               found no direct correlations, but after reducing the
creteness of a word as the sum of the similarities               number of dimensions for a set of words by apply-
between the word and n abstract paradigm words                   ing Singular Value Decomposition, they found a
minus the sum of the similarities between the word               strong correlation between one of the dimensions
and n concrete paradigm words. The paradigm                      and concreteness.
words are found as follows: first one concrete and
one abstract paradigm word are selected such that                2.3   Regression models for concreteness
the concreteness values for all words in the training
                                                                 Tanaka et al. (2013) train a regression model to
data, predicted by using the similarity with these
                                                                 predict concreteness values. As features they
two words is maximized. Then a second concrete
                                                                 use a small number of manually constructed co-
and abstract word are added that again maximize
                                                                 occurrence features, like co-occurrence with sense
the correlation. This process is repeated until n
                                                                 verbs. For training and evaluation they use a subset
abstract and concrete words are found. Turney
                                                                 of 3,455 nouns from the MRC Database. Pear-
et al. (2011) limit the selection to 20 abstract and
                                                                 son’s correlation and Kendall’s τ between the val-
concrete paradigm words. Using half of the MRC
                                                                 ues from the database and their predictions are
data for training and half for testing, they found
                                                                 0.688 and 0.508, respectively.
Spearman correlation coefficient of 0.81 between
                                                                    Paetzold and Specia (2016) train a regression
predicted and observed concreteness values. To
                                                                 model to predict four word norms, among which
compute the similarity between words they use
                                                                 concreteness. Like many other studies, they use the
count based word embeddings of 1000 dimensions
                                                                 data from the MRC database. As features they use
trained on a 5 · 1010 word web corpus. The same ap-
                                                                 word embeddings trained on a set of various large
proach is followed by Köper and Schulte im Walde
                                                                 corpora and a number of word features extracted
(2016) to predict concreteness values for German
                                                                 from WordNet. For each word norm they use half
words using word vectors trained with word2vec
                                                                 of the words to train the model and half of the
(Mikolov et al., 2013) on the DE-COW14AX Ger-
                                                                 words for evaluation. For concreteness they find a
man Web corpus. For training and testing they
                                                                 Pearson correlation coefficient of 0.869.
merged concreteness values from Kanske and Kotz
                                                                    Ehara (2017) trains regression models to predict
(2010) (called Leipzig Word Norms below) and
                                                                 four word norms for Japanese and English words.
Lahl et al. (2009) (called WWN below) and added
                                                                 As features they use word embeddings trained with
in addition translations from sets of English word
                                                                 word2vec and a probability distribution of words
norms for training. 90% of the data were used for
                                                                 over topics found using Latent Dirichlet Allocation.
training, 10% for testing. The Pearson correlation
                                                                 They use a subset of 1,842 words from the MRC
between the test data and the predicted values for
                                                                 data, from which 1,342 words are used for training
concreteness/abstractness was 0.825.
                                                                 and 500 for testing. When both feature sets are
                                                                 trained on the British National Corpus (BNC) and
    2 The current version of MRC has concreteness values ag-
                                                                 used in combination, the best regression model
gregated from different sources for 8,288 words. We assume
that a previous version provided concreteness values for 4,295   gives a Pearson correlation of 0.87 and a Spearman
words.                                                           correlation coefficient of 0.876 on the test data.
   Ljubešić et al. (2018) used a regression model      Word Norms. Only nouns were used to reduce
as well with pre-trained fastText word embeddings       the variance other word classes would introduce.
(Mikolov et al., 2018). They found a Spearman cor-      The experiment was done in 2006 with 32 native
relation coefficient of 0.887 between the predicted     speaker. On two separate days the participants
concreteness values and the values from Brysbaert       rated the words 3 times on a 9-point scale, each
et al. (2014) and a Spearman correlation of 0.872       time for one of the three ratings. This was repeated
on the MRC data, in both cases using 3-fold cross       2 years later with two groups, one with 22 repeat-
validation. A similar result was found by Char-         ing participants from 2006 and a second with 32
bonnier and Wartena (2019), who reach a Pearson         fresh participants. The words were collected from
correlation coefficient of 0.91 on the data from        the Duden dictionary and a previous word list by
Brysbaert et al. (2014), using the same vectors and     the same authors. Only 1 and 2-syllable words and
10-fold cross validation. Here a minor improve-         no compound nouns were allowed.
ment could be realized using part of speech and            The Berlin Word Norms (Vo et al., 2009) and the
frequent suffixes as additional features.               word norms determined by Schmidtke et al. (2014)
   Though all studies use different data and differ-    contain values for valence, arousal and imagery but
ent versions of the MRC Psycholinguistic database,      no values for concreteness. Some more word norms
use different splits and different number of folds      for German, including concreteness, are published
for cross validation and finally use different corre-   by Hager and Hasselhorn (1994).
lation coefficients, all studies report very similar
                                                        3.1   Merged Dataset
results. The correlations that are found are all in
the range of correlations found between various         In order to have a larger dataset for German, pro-
sets of concreteness values (see Charbonnier and        viding more training data for supervised prediction
Wartena, 2019, Table 2).                                algorithms, we created a merged dataset.
                                                           The overlap of the datasets is quite small (see
3   Data                                                Table 1), the correlation between the values for the
                                                        overlapping parts, however, is high (around 0.9).
Both, for English and German, various word norms        Since the Leipzig Word Norm uses low values for
with concreteness values have been created, though      concrete and high values for abstract words, the
some are quite small and only available as printed      correlation between this and the other datasets is
supplements to older publications.                      negative.
   The dataset created by Baschek et al. (1977) and        For the merged data set we use the 7 point scale
Wippich and Bredenkamp (1979) consists of 1698          where 1 means abstract and 7 means concrete. We
words (800 nouns, 400 adjectives, 498 verbs) is         do not simply rescale the values but use linear re-
one of the oldest and still one of the largest word     gression on the overlapping parts such that the val-
norms for German. We will refer to this dataset as      ues for the words in overlapping parts are as close
the Göttingen Word Norms. We removed 40 verbs           as possible. We take the values from the Göttin-
containing an underscore, especially all reflexive      gen Word Norms as an anchor and transform the
verbs (e.g. sich_wünschen; to wish), from the data      other values using the slope and the intercept. The
set. For number of words the experiment was re-         transformed concreteness thus is defined as
peated and two values are given. We only use the
first value in these cases.                                                C0 = α + βC                  (1)
   Lahl et al. (2009) collected values for 2,654        where C is the original value. For WWN α = 0.776
words using crowdsourcing to build a dataset called     and β = 0.608 and for the Leipzig Word Norms
the Web Word Norms (WWN). For the WWN                   α = 7.39 and β = −0.540. Finally, we take the
3,907 subjects committed 190,212 ratings, each          average from all datasets if a word is present in
for at most 50 words. On average each word has 24       more than one source. The dataset thus offers
ratings. They used a 11-point scale were 0 stands       empirical concreteness values for 4,182 German
for the most concrete and 10 for the least concrete     words. In Table 1 we see the high correlation
judgment.                                               of the values in the merged data with those in
   Kanske and Kotz (2010) collected ratings for va-     the original datasets. The merged dataset can be
lence, arousal and concreteness for 1000 nouns.         downloaded from http://textmining.wp.
This dataset is known as the Leipzig Affective          hs-hannover.de/datasets.html
Table 1: Size of the intersections and the Pearson correlation between the concreteness values in the datasets. As
the Merged set is a composition of the other dataset, the intersection is always equal to the size of the other dataset.

                                         Merged WN               Göttingen WN               WWN
                                     Inters. Correl.          Inters. Correl.         Inters. Correl.
               Göttingen WN           1698    0.997
                       WWN            2654    0.969               680      0.900
                Leipzig WN            1000 -0.985                 127     -0.928          488    -0.875

Table 2: Results of 5-fold cross validation using different methods for all datasets. All results are averaged Pearson
correlation coefficients. For Turney we used 20 words per class.

                              Merged             Göttingen WN               WWN               Leipzig WN
                SVR       0.861 (± 0.026)       0.862 (± 0.040)         0.851 (± 0.023)     0.890 (± 0.027)
         Turney et al.    0.849 (± 0.012)       0.842 (± 0.033)         0.851 (± 0.020)     0.901 (± 0.017)


4   Methods                                                   is based in the method of Turney et al. It has to
                                                              be noted that the search of the prototype words in
For each dataset we use two methods to predict the            Turney’s method is extremely slow and not feasible
concreteness values in a five-fold cross validation           for large datasets.
scheme. We compare the method of Turney et al.                   Furthermore, we see that our implementation of
(2011) described above in section 2. Following Tur-           the method of Turney et al. gives slightly better re-
ney et al. (2011) and Köper and Schulte im Walde              sults for WWN and the Leipzig Word Norms than
(2016) we use 20 abstract and 20 concrete proto-              the result found by Köper and Schulte im Walde
type words. As a second method we use Support                 (2016), who used a random split of the unification
Vector Regression (Drucker et al., 1997) and grid             of those two datasets (0.844 and 0.891 vs. 0.825).
search to find optimal hyper parameters (γ = 1,               Besides the possibility that they have chosen a dis-
C = 10 with an rbf kernel). As features we use                advantageous split, we see two differences: In the
the pre-trained Word-embeddings from fastText for             first place we used different word embeddings to
German (Grave et al., 2018).                                  compute the word similarity. Secondly, they added
   All test were done using 5-fold cross validation.          concreteness values from English datasets with Ger-
We use stratified sampling for the Göttingen WN               man translations to the training data. This is only
and the Merged dataset to ensure that each fold has           helpful if concreteness is invariant under transla-
the same number of nouns, verbs and adjectives.               tion. This might be not the case.
For the other dataset we use random splits.
                                                              6    Conclusions and Future Work
5   Results and Discussion
                                                              Datasets with concreteness values for German are
The results for all datasets and both methods are             smaller and less easily accessible than those for En-
given in Table 2. We see in general very high                 glish. One contribution of the present work is that
correlation values for all datasets and both methods.         we aggregated a consistent dataset with over 4000
All correlation values are in a similar range as the          concreteness ratings from three different sources.
correlations between the datasets.                               A possibility to obtain more concreteness ratings
   We can make some interesting observations. The             is to train a model on available ratings and predict
first remarkable fact is, that for all datasets there is      ratings for other words. We show that prediction
no significant difference between the results from            methods that have been tested for English only
the method of Turney et al. (2011) and the regres-            before yield similar results for German. Moreover,
sion model. As far as we know, these methods have             we show that two of the best available methods that
not been compared directly before. This result is             have not been compared on the same data before,
quite surprising, since there are many aspects of             yield similar results with no significant differences
the meaning of a word that determine the word                 on 4 different data sets.
similarity. All of these aspects are used to find the            In near future we will extend the merged dataset
similar words on which the concreteness prediction            with values from some smaller and older studies.
References                                                 topics in multimodal datasets. In Proceedings of
                                                           the 2018 Conference of the North American Chap-
Ilse-Lore Baschek, Jürgen Bredenkamp, Brigitte             ter of the Association for Computational Linguistics:
   Oehrle, and Werner Wippich. 1977. Determina-            Human Language Technologies, Volume 1 (Long Pa-
   tion of imagery, concreteness and meaningfulness of     pers), pages 2194–2205, New Orleans, Louisiana.
   800 nouns. Zeitschrift für experimentelle und ange-     Association for Computational Linguistics.
   wandte Psychologie, 24(3):353–396.

Anna M Borghi, Ferdinand Binkofski, Cristiano Castel-    Felix Hill and Anna Korhonen. 2014. Concreteness
  franchi, Felice Cimatti, Claudia Scorolli, and Luca      and subjectivity as dimensions of lexical meaning.
  Tummolini. 2017. The challenge of abstract con-          In Proceedings of the 52nd Annual Meeting of the
  cepts. Psychological Bulletin, 143(3):263.               Association for Computational Linguistics (Volume
                                                           2: Short Papers), pages 725–731.
Marc Brysbaert, Amy Beth Warriner, and Victor Ku-
 perman. 2014. Concreteness ratings for 40 thousand      Geoff Hollis and Chris Westbury. 2016. The principals
 generally known English word lemmas. Behavior             of meaning: Extracting semantic dimensions from
 Research Methods, 46(3):904–911.                          co-occurrence models of semantics. Psychonomic
                                                           bulletin & review, 23(6):1744–1756.
Jean Charbonnier and Christian Wartena. 2019. Pre-
   dicting word concreteness and imagery. In Proceed-    Sujay Kumar Jauhar and Lucia Specia. 2012. Uow-
   ings of the 13th International Conference on Com-       shef: Simplex–lexical simplicity ranking based on
   putational Semantics-Long Papers, pages 176–187.        contextual and psycholinguistic features. In * SEM
                                                           2012: The First Joint Conference on Lexical and
Max Coltheart. 1981. The MRC Psycholinguistic              Computational Semantics–Volume 1: Proceedings
 Database. The Quarterly Journal of Experimental           of the main conference and the shared task, and Vol-
 Psychology Section A, 33(4):497–505.                      ume 2: Proceedings of the Sixth International Work-
                                                           shop on Semantic Evaluation (SemEval 2012), pages
Harris Drucker, Christopher JC Burges, Linda Kauf-
                                                           477–481.
  man, Alex J Smola, and Vladimir Vapnik. 1997.
  Support vector regression machines. In Advances in
  neural information processing systems, pages 155–      Philipp Kanske and Sonja A. Kotz. 2010. Leipzig affec-
  161.                                                     tive norms for german: A reliability study. Behavior
                                                           Research Methods, 42(4):987–991.
Yo Ehara. 2017. Language-independent prediction of
  psycholinguistic properties of words. In Proceed-      Maximilian Köper and Sabine Schulte im Walde. 2016.
  ings of the Eighth International Joint Conference on    Automatically generated affective norms of abstract-
  Natural Language Processing (Volume 2: Short Pa-        ness, arousal, imageability and valence for 350 000
  pers), pages 330–336.                                   german lemmas. In Proceedings of the Tenth In-
                                                          ternational Conference on Language Resources and
Diego Frassinelli and Sabine Schulte im Walde. 2019.      Evaluation (LREC’16), pages 2595–2598.
  Distributional interaction of concreteness and ab-
  stractness in verb–noun subcategorisation. In Pro-     Olaf Lahl, Anja S. Göritz, Reinhard Pietrowsky, and
  ceedings of the 13th International Conference on         Jessica Rosenberg. 2009. Using the world-wide web
  Computational Semantics - Short Papers, pages 38–        to obtain large-scale word norms: 190,212 ratings
  43, Gothenburg, Sweden. Association for Computa-         on a set of 2,654 german nouns. Behavior Research
  tional Linguistics.                                      Methods, 41(1):13–19.
Michael Friendly, Patricia E. Franklin, David Hoff-      Ting Liu, Kit Cho, G. Aaron Broadwell, Samira Shaikh,
  man, and David C. Rubin. 1982. The Toronto               Tomek Strzalkowski, John Lien, Sarah Taylor, Lau-
  Word Pool: Norms for imagery, concreteness, ortho-       rie Feldman, Boris Yamrom, Nick Webb, Umit
  graphic variables, and grammatical usage for 1,080       Boz, Ignacio Cases, and Ching-sheng Lin. 2014.
  words. Behavior Research Methods & Instrumenta-          Automatic expansion of the MRC psycholinguistic
  tion, 14(4):375–399.                                     database imageability ratings. In Proceedings of
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-        the Ninth International Conference on Language
  mand Joulin, and Tomas Mikolov. 2018. Learning           Resources and Evaluation (LREC’14), pages 2800–
  word vectors for 157 languages. In Proceedings           2805, Reykjavik, Iceland. European Language Re-
  of the International Conference on Language Re-          sources Association (ELRA).
  sources and Evaluation (LREC 2018).
                                                         Nikola Ljubešić, Darja Fišer, and Anita Peti-Stantić.
Willi Hager and Marcus Hasselhorn, editors. 1994.          2018. Predicting concreteness and imageability of
  Handbuch deutschsprachiger Wortnormen. Hogrefe           words within and across languages via word em-
  Verlag für Psychologie, Göttingen.                       beddings. In Proceedings of The Third Workshop
                                                           on Representation Learning for NLP, pages 217–
Jack Hessel, David Mimno, and Lillian Lee. 2018.           222, Melbourne, Australia. Association for Compu-
   Quantifying the visual concreteness of words and        tational Linguistics.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey       Werner Wippich and Jürgen Bredenkamp. 1979.
  Dean. 2013. Efficient estimation of word represen-      Bildhaftigkeit und Lernen, volume 78 of Wis-
  tations in vector space.                                 senschaftliche Forschungsberichte. Steinkopff-
                                                          Verlag, Darmstadt.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
  Christian Puhrsch, and Armand Joulin. 2018. Ad-
  vances in pre-training distributed word representa-
  tions. In Proceedings of the International Confer-
  ence on Language Resources and Evaluation (LREC
  2018).

Gustavo Paetzold and Lucia Specia. 2016. Inferring
  psycholinguistic properties of words. In Proceed-
  ings of the 2016 Conference of the North Ameri-
  can Chapter of the Association for Computational
  Linguistics: Human Language Technologies, pages
  435–440, San Diego, California. Association for
  Computational Linguistics.

Bénédicte Pierrejean and Ludovic Tanguy. 2019. Inves-
  tigating the stability of concrete nouns in word em-
  beddings. In Proceedings of the 13th International
  Conference on Computational Semantics - Short Pa-
  pers, pages 65–70, Gothenburg, Sweden. Associa-
  tion for Computational Linguistics.

E. Rabinovich, B. Sznajder, A. Spector, I. Shnayder-
   man, R. Aharonov, D. Konopnicki, and N. Slonim.
   2018. Learning Concept Abstractness Using Weak
   Supervision. ArXiv e-prints.

Sascha Rothe, Sebastian Ebert, and Hinrich Schütze.
  2016. Ultradense word embeddings by orthogonal
  transformation. In Proceedings of the 2016 Confer-
  ence of the North American Chapter of the Associ-
  ation for Computational Linguistics: Human Lan-
  guage Technologies, pages 767–777. Association for
  Computational Linguistics.

David S Schmidtke, Tobias Schröder, Arthur M Jacobs,
  and Markus Conrad. 2014. Angst: Affective norms
  for german sentiment terms, derived from the affec-
  tive norms for english words. Behavior research
  methods, 46(4):1108–1118.

Shinya Tanaka, Adam Jatowt, Makoto P. Kato, and Kat-
  sumi Tanaka. 2013. Estimating content concrete-
  ness for finding comprehensible documents. In Pro-
  ceedings of the Sixth ACM International Conference
  on Web Search and Data Mining, WSDM ’13, pages
  475–484, New York, NY, USA. ACM.

Peter D. Turney, Yair Neuman, Dan Assaf, and Yohai
  Cohen. 2011. Literal and metaphorical sense iden-
  tification through concrete and abstract context. In
  Proceedings of the Conference on Empirical Meth-
  ods in Natural Language Processing, EMNLP ’11,
  pages 680–690, Stroudsburg, PA, USA. Association
  for Computational Linguistics.

Melissa LH Vo, Markus Conrad, Lars Kuchinke,
 Karolina Urton, Markus J Hofmann, and Arthur M
 Jacobs. 2009.     The Berlin affective word list
 reloaded (bawl-r).   Behavior research methods,
 41(2):534–538.