Modelling semantic transparency in English compound nouns Melanie J. Bell Martin Schäfer Anglia Ruskin University Friedrich Schiller University Cambridge Jena U.K. Germany melanie.bell@anglia.ac.uk post@martinschaefer.info 1 Introduction the degree of expectedness of a particular word sense and a particular relation for a given con- Semantic transparency is known to play an im- stituent. In this paper, we provide evidence in portant role in the storage and processing of support of this hypothesis: the more expected the complex words (e.g. Marslen-Wilson et al. word sense and relation for a constituent, the 1994), and human raters of transparency achieve more transparent it is perceived to be. high levels of agreement (e.g. Frisson et al. 2008, Munro et al. 2010). In the case of noun-noun 2 Method compounds, overall transparency is largely de- termined by the transparency of the individual We used the publicly available dataset described constituents. For example, Reddy et al. (2011) in Reddy et al. (2011), which gives human trans- showed that the perceived transparency of a parency ratings for a set of 90 compound types compound is highly correlated with both the sum and their constituents (N1 and N2), and compris- and the product of the perceived transparencies es a total of 7717 ratings. To model the expect- of its constituents. Furthermore, many psycho- edness of word senses and semantic relations for linguistic studies find significant effects for se- a given compound constituent, we used the con- mantic transparency using a four-way distinction stituent families of the compounds, which we based on perceived constituent transparency: extracted in a two step process. We took all transparent-transparent (e.g. carwash), transpar- strings of exactly two nouns that follow an article ent-opaque (e.g. jailbird), opaque-transparent in the British National Corpus and which also (e.g. strawberry) and opaque-opaque (e.g. hog- occur four times or more in the USENET corpus wash) (Libben et al. 2003). Bell and Schäfer (Shaoul and Westbury 2010). From this set, we (2013) modelled the transparency of individual extracted the positional constituent families for compound constituents and showed that shifted all constituent nouns in the Reddy et al. dataset, word senses reduce perceived transparency, giving a total of 4553 compounds for the N1 while certain semantic relations between constit- families and 9226 for the N2 families. Each of uents increase it. However, this finding is prob- these compound types was coded for the seman- lematic in at least two ways. Firstly, it is not tic relation between the constituents (after Levi clear whether there is a solid basis for establish- 1978), and for the WordNet sense of the constit- ing whether a specific word sense is shifted or uent under consideration (Princeton 2010). We not. For example, card in credit card is clearly then calculated the proportion of compound shifted if viewed etymologically, but may not types in each constituent family with each se- synchronically be perceived as shifted due to its mantic relation (relation proportion), and each frequent use. Secondly, work on conceptual WordNet sense of the constituent in question combination by Gagné and collaborators has (synset proportion). We take these two measures shown that relational information in compounds to reflect the expectedness of the respective rela- is accessed via the concepts associated with indi- tions and WordNet senses of the constituents: if a vidual modifiers and heads, rather than inde- relation or sense occurs in a high proportion of pendently of them (e.g. Spalding et al. 2010 for the constituent family, it is more expected. These an overview). This leads to the hypothesis that it variables were used, along with other quantita- is not whether a specific word sense is etymolog- tive measures, as predictors in ordinary least ically shifted, nor whether a specific semantic squares regression models of perceived constitu- relation is used per se, that makes a compound ent transparency. The final model for the trans- constituent more or less transparent; rather, it is parency of N1 is given in Table 1: Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org 63 Coef S.E. t Pr(>|t|) Intercept -4.6413 0.6593 -7.04 <0.0001 relation proportion in N1family -0.2187 0.6013 -0.36 0.7161 log family size of N1 -0.0189 0.0931 -0.20 0.8395 synset proportion in N1family -0.2426 0.6152 -0.39 0.6934 log synset count of N1 -0.7939 0.2469 -3.22 0.0013 compound proportion in N1 family (token-based) 3.0130 0.6788 4.44 <0.0001 log frequency of N1 0.8728 0.0569 15.34 <0.0001 relation proportion * log family size 0.3311 0.1305 2.54 0.0113 synset proportion * log synset count 0.6855 0.3161 2.17 0.0303 compound proportion * log frequency N1 -0.2804 0.0816 -3.44 0.0006 Table 1: Final model for the transparency of N1, R2 adjusted = 0.334 0 1. 4. 5 0 log synset count of N1 6 log family size of N1 5 3.0 1. log frequency of N1 3. 10 4 5 0 5 2.5 2. 3.0 3 5 4 2.0 2. 8 2 3 1.5 1 2.5 3.0 6 1.0 0 2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 relation proportion in N1family synset proportion in N1family compound proportion in N1 family (token-based) Figure 1. Interaction plots for N1 transparency 3 Results 4 Conclusion All predictors in our model enter into significant Overall, the model provides clear evidence for interactions, and these are shown graphically in our hypothesis. N1 is rated as most transparent Figure 1, where the contour lines on the plots when it is a frequent word, with a large family, represent perceived transparency of the first con- occurring with its preferred semantic relation and stituent (N1). The first plot shows an interaction most frequent sense, and with few other senses to between relation proportion and overall (log) compete. We interpret the results as indicating family size: for small families, relation propor- that compound constituents are perceived as tion plays little role, whereas for larger families, more transparent when they are more expected in accordance with our hypothesis, the transpar- (both generally and with a specific sense) and ency of N1 increases with the proportion of the when they occur in their most expected semantic corresponding relation in the family. The second environments. In information theory, the less plot shows the interaction between the synset expected an event, the greater its information proportion and the total number of a constitu- content: in so far as perceived transparency is a ent’s senses (as listed in WordNet): only if there reflection of expectedness, it can therefore also is a sufficient number of different senses in the be seen as the inverse of informativity. family is their proportion a reliable predictor of semantic transparency. There is also a small but Acknowledgements significant interaction between the log frequency This work was made possible by three short visit of a constituent and the proportion of the constit- grants from the European Science Foundation uent family (in terms of tokens) represented by through NETWORDS - The European Network the compound in question: this shows that trans- on Word Structure (grants 4677, 6520 and 7027), parency increases with frequency, but only in the for which the authors are extremely grateful. lower frequently ranges does the proportion in the family play a role. 64 References Bell, Melanie J. and Martin Schäfer. 2013. Semantic transparency: challenges for distributional semantics. In Aurelie Herbelot, Roberto Zamparelli and Gemma Boleda eds., Proceedings of the IWCS 2013 workshop: Towards a formal distributional semantics, 1–10. Potsdam: Association for Computational Linguistics. Frisson, Steven, Elizabeth Niswander-Klement and Alexander Pollatsek. 2008. The role of semantic transparency in the processing of English com- pound words. British Journal of Psychology 991, 87–107. Levi, Judith N. 1978. The syntax and semantics of complex nominals. New York: Academic Press. Marslen-Wilson, William, Lorraine K. Tyler, Rachelle Waksler and Lianne Older. 1994. Mor- phology and meaning in the English mental lexi- con. Psychological Review 101, 1: 3-33. Munro, Robert, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai , Robin Melnick, Christopher Potts, Tyler Schnoebelen and Harry Tily. 2010. Crowdsourcing and language studies: the new generation of linguistic data. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pp. 122-130. Association for Computational Linguistics. Princeton University. 2010. WordNet. Reddy, Siva, Diana McCarthy and Suresh Manandhar. 2011. An empirical study on compositionality in compound nouns. In Proceedings of The 5th In- ternational Joint Conference on Natural Lan- guage Processing 2011 IJCNLP 2011, Chiang Mai, Thailand Shaoul, Cyrus and Chris Westbury. 2010. An anonymized multi-billion word USENET corpus 2005-2010 http://www.psych.ualberta.ca/˜westburylab/downl oads/usenet.download.html Spalding, Thomas L., Christina L. Gagné, Allison C. Mullaly and Hongbo Ji. 2010. Relation-based in- terpretation of noun-noun phrases: A new theoret- ical approach. Linguistische Berichte Sonderheft 17, 283-315 Wurm, Lee H. 1997. Auditory processing of prefixed English words is both continuous and decompositional. Journal of Memory and Lan- guage, 37, 438–461. 65