Predicting the Concreteness of German Words Jean Charbonnier and Christian Wartena Hochschule Hannover Expo Plaza 12, 30539 Hannover, Germany {jean.charbonnier, christian.wartena}@hs-hannover.de Abstract tity, but found that subjects largely rated the haptic and visual experiences, even if they were explicitly Concreteness of words has been measured asked to take into account experiences involving and used in psycholinguistics already for any senses. decades. Recently, it is also used in re- Concreteness seems to play an important role in trieval and NLP tasks. For English a num- human language processing (Borghi et al., 2017). ber of well known datasets has been estab- Concreteness also has been used for various compu- lished with average values for perceived tational linguistic tasks like detection of metaphors concreteness. We give an overview of and non-literal language (Turney et al., 2011; Hill available datasets for German, their cor- and Korhonen, 2014; Frassinelli and Schulte im relation and evaluate prediction algorithms Walde, 2019), lexical simplification (Jauhar and for concreteness of German words. We Specia, 2012), multimodal retrieval (Hessel et al., show that these algorithms achieve similar 2018) or estimating the stability of word embed- results as for English datasets. Moreover, dings (Pierrejean and Tanguy, 2019). we show for all datasets there are no sig- Traditionally, word norms are obtained by asking nificant differences between a prediction subjects to estimate the value for each property on model based on a regression model using a Likert scale. Recently, also various approaches word embeddings as features and a predic- have been proposed to predict the concreteness of tion algorithm based on word similarity words. On three different datasets we will test two according to the same embeddings. algorithms that have given very good results for En- 1 Motivation glish data and compare the results in section 4 after we have discussed the most common approaches A number of properties of words, mainly of seman- to predict word concreteness (section 2) and pre- tic nature, have been studied and used in psycholin- sented the concreteness data available for German guistic research for decades. These properties often (section 3). referred to as (affective) word norms, include con- creteness, imagery1 , age of acquisition, valence, 2 Related work and arousal. In the present work we will focus on concreteness. Friendly et al. (1982) define concrete We find basically three approaches to predict the words as words that “refer to tangible objects, ma- concreteness of a word: (1) adopting the concrete- terials or persons which can be easily perceived ness value from similar, related or neighboring with the senses”. Similarly, Brysbaert et al. (2014) words; (2) identifying a dimension in word embed- define concreteness as the degree to which the con- dings that corresponds to concreteness; (3) training cept denoted by a word refers to a perceptible en- a regression model on features of words Copyright c 2020 for this paper by its authors. Use permitted 2.1 Adopting concreteness of related words under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0) Liu et al. (2014) predict values for imagery, a word 1 Most authors seem to use the term imagery, while others norm that strongly correlates with concreteness, by also use imageability and visualness. In German the term Bild- haftigkeit is the most common one, while also Vorstellbarkeit using the values from synonyms and hypernyms is found. We will use imagery throughout this paper. found in WordNet. Rabinovich et al. (2018) predict the concreteness 2.2 Concreteness in word embeddings of words indirectly by assigning a concreteness Rothe et al. (2016) try to find low-dimensional fea- value to sentences in which a word occurs. The ture representations of words in which at least some concreteness value of a sentence is based on the dimensions correspond to interpretable properties presence of seed words. The set of seed words is of words. One of these dimensions is concreteness. constructed by selecting words with derivational For training and testing they use Google News em- suffixes that are typical for highly abstract nouns. beddings and two subsets of frequent words from The correlation between the predicted values and the norms of Brysbaert et al. (2014). For their test the manual assigned values from various subsets set of 8,694 frequent words they found a moderate of the dataset from Brysbaert et al. (2014) and the correlation with the human judgments (Kendall’s 4,295 concreteness values2 from the MRC (Medi- τ = 0, 623). Similarly, Hollis and Westbury (2016) cal Research Council) Psycholinguistic Database looked which dimension of word embeddings cor- (Coltheart, 1981) ranges from 0.66 to 0.74. relate to one of the classical word norms. They Turney et al. (2011) compute the degree of con- found no direct correlations, but after reducing the creteness of a word as the sum of the similarities number of dimensions for a set of words by apply- between the word and n abstract paradigm words ing Singular Value Decomposition, they found a minus the sum of the similarities between the word strong correlation between one of the dimensions and n concrete paradigm words. The paradigm and concreteness. words are found as follows: first one concrete and one abstract paradigm word are selected such that 2.3 Regression models for concreteness the concreteness values for all words in the training Tanaka et al. (2013) train a regression model to data, predicted by using the similarity with these predict concreteness values. As features they two words is maximized. Then a second concrete use a small number of manually constructed co- and abstract word are added that again maximize occurrence features, like co-occurrence with sense the correlation. This process is repeated until n verbs. For training and evaluation they use a subset abstract and concrete words are found. Turney of 3,455 nouns from the MRC Database. Pear- et al. (2011) limit the selection to 20 abstract and son’s correlation and Kendall’s τ between the val- concrete paradigm words. Using half of the MRC ues from the database and their predictions are data for training and half for testing, they found 0.688 and 0.508, respectively. Spearman correlation coefficient of 0.81 between Paetzold and Specia (2016) train a regression predicted and observed concreteness values. To model to predict four word norms, among which compute the similarity between words they use concreteness. Like many other studies, they use the count based word embeddings of 1000 dimensions data from the MRC database. As features they use trained on a 5 · 1010 word web corpus. The same ap- word embeddings trained on a set of various large proach is followed by Köper and Schulte im Walde corpora and a number of word features extracted (2016) to predict concreteness values for German from WordNet. For each word norm they use half words using word vectors trained with word2vec of the words to train the model and half of the (Mikolov et al., 2013) on the DE-COW14AX Ger- words for evaluation. For concreteness they find a man Web corpus. For training and testing they Pearson correlation coefficient of 0.869. merged concreteness values from Kanske and Kotz Ehara (2017) trains regression models to predict (2010) (called Leipzig Word Norms below) and four word norms for Japanese and English words. Lahl et al. (2009) (called WWN below) and added As features they use word embeddings trained with in addition translations from sets of English word word2vec and a probability distribution of words norms for training. 90% of the data were used for over topics found using Latent Dirichlet Allocation. training, 10% for testing. The Pearson correlation They use a subset of 1,842 words from the MRC between the test data and the predicted values for data, from which 1,342 words are used for training concreteness/abstractness was 0.825. and 500 for testing. When both feature sets are trained on the British National Corpus (BNC) and 2 The current version of MRC has concreteness values ag- used in combination, the best regression model gregated from different sources for 8,288 words. We assume that a previous version provided concreteness values for 4,295 gives a Pearson correlation of 0.87 and a Spearman words. correlation coefficient of 0.876 on the test data. Ljubešić et al. (2018) used a regression model Word Norms. Only nouns were used to reduce as well with pre-trained fastText word embeddings the variance other word classes would introduce. (Mikolov et al., 2018). They found a Spearman cor- The experiment was done in 2006 with 32 native relation coefficient of 0.887 between the predicted speaker. On two separate days the participants concreteness values and the values from Brysbaert rated the words 3 times on a 9-point scale, each et al. (2014) and a Spearman correlation of 0.872 time for one of the three ratings. This was repeated on the MRC data, in both cases using 3-fold cross 2 years later with two groups, one with 22 repeat- validation. A similar result was found by Char- ing participants from 2006 and a second with 32 bonnier and Wartena (2019), who reach a Pearson fresh participants. The words were collected from correlation coefficient of 0.91 on the data from the Duden dictionary and a previous word list by Brysbaert et al. (2014), using the same vectors and the same authors. Only 1 and 2-syllable words and 10-fold cross validation. Here a minor improve- no compound nouns were allowed. ment could be realized using part of speech and The Berlin Word Norms (Vo et al., 2009) and the frequent suffixes as additional features. word norms determined by Schmidtke et al. (2014) Though all studies use different data and differ- contain values for valence, arousal and imagery but ent versions of the MRC Psycholinguistic database, no values for concreteness. Some more word norms use different splits and different number of folds for German, including concreteness, are published for cross validation and finally use different corre- by Hager and Hasselhorn (1994). lation coefficients, all studies report very similar 3.1 Merged Dataset results. The correlations that are found are all in the range of correlations found between various In order to have a larger dataset for German, pro- sets of concreteness values (see Charbonnier and viding more training data for supervised prediction Wartena, 2019, Table 2). algorithms, we created a merged dataset. The overlap of the datasets is quite small (see 3 Data Table 1), the correlation between the values for the overlapping parts, however, is high (around 0.9). Both, for English and German, various word norms Since the Leipzig Word Norm uses low values for with concreteness values have been created, though concrete and high values for abstract words, the some are quite small and only available as printed correlation between this and the other datasets is supplements to older publications. negative. The dataset created by Baschek et al. (1977) and For the merged data set we use the 7 point scale Wippich and Bredenkamp (1979) consists of 1698 where 1 means abstract and 7 means concrete. We words (800 nouns, 400 adjectives, 498 verbs) is do not simply rescale the values but use linear re- one of the oldest and still one of the largest word gression on the overlapping parts such that the val- norms for German. We will refer to this dataset as ues for the words in overlapping parts are as close the Göttingen Word Norms. We removed 40 verbs as possible. We take the values from the Göttin- containing an underscore, especially all reflexive gen Word Norms as an anchor and transform the verbs (e.g. sich_wünschen; to wish), from the data other values using the slope and the intercept. The set. For number of words the experiment was re- transformed concreteness thus is defined as peated and two values are given. We only use the first value in these cases. C0 = α + βC (1) Lahl et al. (2009) collected values for 2,654 where C is the original value. For WWN α = 0.776 words using crowdsourcing to build a dataset called and β = 0.608 and for the Leipzig Word Norms the Web Word Norms (WWN). For the WWN α = 7.39 and β = −0.540. Finally, we take the 3,907 subjects committed 190,212 ratings, each average from all datasets if a word is present in for at most 50 words. On average each word has 24 more than one source. The dataset thus offers ratings. They used a 11-point scale were 0 stands empirical concreteness values for 4,182 German for the most concrete and 10 for the least concrete words. In Table 1 we see the high correlation judgment. of the values in the merged data with those in Kanske and Kotz (2010) collected ratings for va- the original datasets. The merged dataset can be lence, arousal and concreteness for 1000 nouns. downloaded from http://textmining.wp. This dataset is known as the Leipzig Affective hs-hannover.de/datasets.html Table 1: Size of the intersections and the Pearson correlation between the concreteness values in the datasets. As the Merged set is a composition of the other dataset, the intersection is always equal to the size of the other dataset. Merged WN Göttingen WN WWN Inters. Correl. Inters. Correl. Inters. Correl. Göttingen WN 1698 0.997 WWN 2654 0.969 680 0.900 Leipzig WN 1000 -0.985 127 -0.928 488 -0.875 Table 2: Results of 5-fold cross validation using different methods for all datasets. All results are averaged Pearson correlation coefficients. For Turney we used 20 words per class. Merged Göttingen WN WWN Leipzig WN SVR 0.861 (± 0.026) 0.862 (± 0.040) 0.851 (± 0.023) 0.890 (± 0.027) Turney et al. 0.849 (± 0.012) 0.842 (± 0.033) 0.851 (± 0.020) 0.901 (± 0.017) 4 Methods is based in the method of Turney et al. It has to be noted that the search of the prototype words in For each dataset we use two methods to predict the Turney’s method is extremely slow and not feasible concreteness values in a five-fold cross validation for large datasets. scheme. We compare the method of Turney et al. Furthermore, we see that our implementation of (2011) described above in section 2. Following Tur- the method of Turney et al. gives slightly better re- ney et al. (2011) and Köper and Schulte im Walde sults for WWN and the Leipzig Word Norms than (2016) we use 20 abstract and 20 concrete proto- the result found by Köper and Schulte im Walde type words. As a second method we use Support (2016), who used a random split of the unification Vector Regression (Drucker et al., 1997) and grid of those two datasets (0.844 and 0.891 vs. 0.825). search to find optimal hyper parameters (γ = 1, Besides the possibility that they have chosen a dis- C = 10 with an rbf kernel). As features we use advantageous split, we see two differences: In the the pre-trained Word-embeddings from fastText for first place we used different word embeddings to German (Grave et al., 2018). compute the word similarity. Secondly, they added All test were done using 5-fold cross validation. concreteness values from English datasets with Ger- We use stratified sampling for the Göttingen WN man translations to the training data. This is only and the Merged dataset to ensure that each fold has helpful if concreteness is invariant under transla- the same number of nouns, verbs and adjectives. tion. This might be not the case. For the other dataset we use random splits. 6 Conclusions and Future Work 5 Results and Discussion Datasets with concreteness values for German are The results for all datasets and both methods are smaller and less easily accessible than those for En- given in Table 2. We see in general very high glish. One contribution of the present work is that correlation values for all datasets and both methods. we aggregated a consistent dataset with over 4000 All correlation values are in a similar range as the concreteness ratings from three different sources. correlations between the datasets. A possibility to obtain more concreteness ratings We can make some interesting observations. The is to train a model on available ratings and predict first remarkable fact is, that for all datasets there is ratings for other words. We show that prediction no significant difference between the results from methods that have been tested for English only the method of Turney et al. (2011) and the regres- before yield similar results for German. Moreover, sion model. As far as we know, these methods have we show that two of the best available methods that not been compared directly before. This result is have not been compared on the same data before, quite surprising, since there are many aspects of yield similar results with no significant differences the meaning of a word that determine the word on 4 different data sets. similarity. All of these aspects are used to find the In near future we will extend the merged dataset similar words on which the concreteness prediction with values from some smaller and older studies. References topics in multimodal datasets. In Proceedings of the 2018 Conference of the North American Chap- Ilse-Lore Baschek, Jürgen Bredenkamp, Brigitte ter of the Association for Computational Linguistics: Oehrle, and Werner Wippich. 1977. Determina- Human Language Technologies, Volume 1 (Long Pa- tion of imagery, concreteness and meaningfulness of pers), pages 2194–2205, New Orleans, Louisiana. 800 nouns. Zeitschrift für experimentelle und ange- Association for Computational Linguistics. wandte Psychologie, 24(3):353–396. Anna M Borghi, Ferdinand Binkofski, Cristiano Castel- Felix Hill and Anna Korhonen. 2014. Concreteness franchi, Felice Cimatti, Claudia Scorolli, and Luca and subjectivity as dimensions of lexical meaning. Tummolini. 2017. The challenge of abstract con- In Proceedings of the 52nd Annual Meeting of the cepts. Psychological Bulletin, 143(3):263. Association for Computational Linguistics (Volume 2: Short Papers), pages 725–731. Marc Brysbaert, Amy Beth Warriner, and Victor Ku- perman. 2014. Concreteness ratings for 40 thousand Geoff Hollis and Chris Westbury. 2016. The principals generally known English word lemmas. Behavior of meaning: Extracting semantic dimensions from Research Methods, 46(3):904–911. co-occurrence models of semantics. Psychonomic bulletin & review, 23(6):1744–1756. Jean Charbonnier and Christian Wartena. 2019. Pre- dicting word concreteness and imagery. In Proceed- Sujay Kumar Jauhar and Lucia Specia. 2012. Uow- ings of the 13th International Conference on Com- shef: Simplex–lexical simplicity ranking based on putational Semantics-Long Papers, pages 176–187. contextual and psycholinguistic features. In * SEM 2012: The First Joint Conference on Lexical and Max Coltheart. 1981. The MRC Psycholinguistic Computational Semantics–Volume 1: Proceedings Database. The Quarterly Journal of Experimental of the main conference and the shared task, and Vol- Psychology Section A, 33(4):497–505. ume 2: Proceedings of the Sixth International Work- shop on Semantic Evaluation (SemEval 2012), pages Harris Drucker, Christopher JC Burges, Linda Kauf- 477–481. man, Alex J Smola, and Vladimir Vapnik. 1997. Support vector regression machines. In Advances in neural information processing systems, pages 155– Philipp Kanske and Sonja A. Kotz. 2010. Leipzig affec- 161. tive norms for german: A reliability study. Behavior Research Methods, 42(4):987–991. Yo Ehara. 2017. Language-independent prediction of psycholinguistic properties of words. In Proceed- Maximilian Köper and Sabine Schulte im Walde. 2016. ings of the Eighth International Joint Conference on Automatically generated affective norms of abstract- Natural Language Processing (Volume 2: Short Pa- ness, arousal, imageability and valence for 350 000 pers), pages 330–336. german lemmas. In Proceedings of the Tenth In- ternational Conference on Language Resources and Diego Frassinelli and Sabine Schulte im Walde. 2019. Evaluation (LREC’16), pages 2595–2598. Distributional interaction of concreteness and ab- stractness in verb–noun subcategorisation. In Pro- Olaf Lahl, Anja S. Göritz, Reinhard Pietrowsky, and ceedings of the 13th International Conference on Jessica Rosenberg. 2009. Using the world-wide web Computational Semantics - Short Papers, pages 38– to obtain large-scale word norms: 190,212 ratings 43, Gothenburg, Sweden. Association for Computa- on a set of 2,654 german nouns. Behavior Research tional Linguistics. Methods, 41(1):13–19. Michael Friendly, Patricia E. Franklin, David Hoff- Ting Liu, Kit Cho, G. Aaron Broadwell, Samira Shaikh, man, and David C. Rubin. 1982. The Toronto Tomek Strzalkowski, John Lien, Sarah Taylor, Lau- Word Pool: Norms for imagery, concreteness, ortho- rie Feldman, Boris Yamrom, Nick Webb, Umit graphic variables, and grammatical usage for 1,080 Boz, Ignacio Cases, and Ching-sheng Lin. 2014. words. Behavior Research Methods & Instrumenta- Automatic expansion of the MRC psycholinguistic tion, 14(4):375–399. database imageability ratings. In Proceedings of Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- the Ninth International Conference on Language mand Joulin, and Tomas Mikolov. 2018. Learning Resources and Evaluation (LREC’14), pages 2800– word vectors for 157 languages. In Proceedings 2805, Reykjavik, Iceland. European Language Re- of the International Conference on Language Re- sources Association (ELRA). sources and Evaluation (LREC 2018). Nikola Ljubešić, Darja Fišer, and Anita Peti-Stantić. Willi Hager and Marcus Hasselhorn, editors. 1994. 2018. Predicting concreteness and imageability of Handbuch deutschsprachiger Wortnormen. Hogrefe words within and across languages via word em- Verlag für Psychologie, Göttingen. beddings. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 217– Jack Hessel, David Mimno, and Lillian Lee. 2018. 222, Melbourne, Australia. Association for Compu- Quantifying the visual concreteness of words and tational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Werner Wippich and Jürgen Bredenkamp. 1979. Dean. 2013. Efficient estimation of word represen- Bildhaftigkeit und Lernen, volume 78 of Wis- tations in vector space. senschaftliche Forschungsberichte. Steinkopff- Verlag, Darmstadt. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Ad- vances in pre-training distributed word representa- tions. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC 2018). Gustavo Paetzold and Lucia Specia. 2016. Inferring psycholinguistic properties of words. In Proceed- ings of the 2016 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 435–440, San Diego, California. Association for Computational Linguistics. Bénédicte Pierrejean and Ludovic Tanguy. 2019. Inves- tigating the stability of concrete nouns in word em- beddings. In Proceedings of the 13th International Conference on Computational Semantics - Short Pa- pers, pages 65–70, Gothenburg, Sweden. Associa- tion for Computational Linguistics. E. Rabinovich, B. Sznajder, A. Spector, I. Shnayder- man, R. Aharonov, D. Konopnicki, and N. Slonim. 2018. Learning Concept Abstractness Using Weak Supervision. ArXiv e-prints. Sascha Rothe, Sebastian Ebert, and Hinrich Schütze. 2016. Ultradense word embeddings by orthogonal transformation. In Proceedings of the 2016 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, pages 767–777. Association for Computational Linguistics. David S Schmidtke, Tobias Schröder, Arthur M Jacobs, and Markus Conrad. 2014. Angst: Affective norms for german sentiment terms, derived from the affec- tive norms for english words. Behavior research methods, 46(4):1108–1118. Shinya Tanaka, Adam Jatowt, Makoto P. Kato, and Kat- sumi Tanaka. 2013. Estimating content concrete- ness for finding comprehensible documents. In Pro- ceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, pages 475–484, New York, NY, USA. ACM. Peter D. Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and metaphorical sense iden- tification through concrete and abstract context. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, EMNLP ’11, pages 680–690, Stroudsburg, PA, USA. Association for Computational Linguistics. Melissa LH Vo, Markus Conrad, Lars Kuchinke, Karolina Urton, Markus J Hofmann, and Arthur M Jacobs. 2009. The Berlin affective word list reloaded (bawl-r). Behavior research methods, 41(2):534–538.