Modelling semantic transparency in English compound nouns
                   Melanie J. Bell            Martin Schäfer
               Anglia Ruskin University Friedrich Schiller University
                      Cambridge                     Jena
                        U.K.                      Germany
           melanie.bell@anglia.ac.uk post@martinschaefer.info


1    Introduction                                              the degree of expectedness of a particular word
                                                               sense and a particular relation for a given con-
Semantic transparency is known to play an im-                  stituent. In this paper, we provide evidence in
portant role in the storage and processing of                  support of this hypothesis: the more expected the
complex words (e.g. Marslen-Wilson et al.                      word sense and relation for a constituent, the
1994), and human raters of transparency achieve                more transparent it is perceived to be.
high levels of agreement (e.g. Frisson et al. 2008,
Munro et al. 2010). In the case of noun-noun                   2    Method
compounds, overall transparency is largely de-
termined by the transparency of the individual                 We used the publicly available dataset described
constituents. For example, Reddy et al. (2011)                 in Reddy et al. (2011), which gives human trans-
showed that the perceived transparency of a                    parency ratings for a set of 90 compound types
compound is highly correlated with both the sum                and their constituents (N1 and N2), and compris-
and the product of the perceived transparencies                es a total of 7717 ratings. To model the expect-
of its constituents. Furthermore, many psycho-                 edness of word senses and semantic relations for
linguistic studies find significant effects for se-            a given compound constituent, we used the con-
mantic transparency using a four-way distinction               stituent families of the compounds, which we
based on perceived constituent transparency:                   extracted in a two step process. We took all
transparent-transparent (e.g. carwash), transpar-              strings of exactly two nouns that follow an article
ent-opaque (e.g. jailbird), opaque-transparent                 in the British National Corpus and which also
(e.g. strawberry) and opaque-opaque (e.g. hog-                 occur four times or more in the USENET corpus
wash) (Libben et al. 2003). Bell and Schäfer                   (Shaoul and Westbury 2010). From this set, we
(2013) modelled the transparency of individual                 extracted the positional constituent families for
compound constituents and showed that shifted                  all constituent nouns in the Reddy et al. dataset,
word senses reduce perceived transparency,                     giving a total of 4553 compounds for the N1
while certain semantic relations between constit-              families and 9226 for the N2 families. Each of
uents increase it. However, this finding is prob-              these compound types was coded for the seman-
lematic in at least two ways. Firstly, it is not               tic relation between the constituents (after Levi
clear whether there is a solid basis for establish-            1978), and for the WordNet sense of the constit-
ing whether a specific word sense is shifted or                uent under consideration (Princeton 2010). We
not. For example, card in credit card is clearly               then calculated the proportion of compound
shifted if viewed etymologically, but may not                  types in each constituent family with each se-
synchronically be perceived as shifted due to its              mantic relation (relation proportion), and each
frequent use. Secondly, work on conceptual                     WordNet sense of the constituent in question
combination by Gagné and collaborators has                     (synset proportion). We take these two measures
shown that relational information in compounds                 to reflect the expectedness of the respective rela-
is accessed via the concepts associated with indi-             tions and WordNet senses of the constituents: if a
vidual modifiers and heads, rather than inde-                  relation or sense occurs in a high proportion of
pendently of them (e.g. Spalding et al. 2010 for               the constituent family, it is more expected. These
an overview). This leads to the hypothesis that it             variables were used, along with other quantita-
is not whether a specific word sense is etymolog-              tive measures, as predictors in ordinary least
ically shifted, nor whether a specific semantic                squares regression models of perceived constitu-
relation is used per se, that makes a compound                 ent transparency. The final model for the trans-
constituent more or less transparent; rather, it is            parency of N1 is given in Table 1:
          Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                          Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org


                                                          63
                                                                                                                                              Coef      S.E.                                    t                Pr(>|t|)
                            Intercept                                                                                                      -4.6413   0.6593                                 -7.04               <0.0001
                            relation proportion in N1family                                                                                -0.2187   0.6013                                 -0.36                0.7161
                            log family size of N1                                                                                          -0.0189   0.0931                                 -0.20                0.8395
                            synset proportion in N1family                                                                                  -0.2426   0.6152                                 -0.39                0.6934
                            log synset count of N1                                                                                         -0.7939   0.2469                                 -3.22                0.0013
                            compound proportion in N1 family (token-based)                                                                  3.0130   0.6788                                  4.44               <0.0001
                            log frequency of N1                                                                                             0.8728   0.0569                                 15.34               <0.0001
                            relation proportion * log family size                                                                           0.3311   0.1305                                  2.54                0.0113
                            synset proportion * log synset count                                                                            0.6855   0.3161                                  2.17                0.0303
                            compound proportion * log frequency N1                                                                         -0.2804   0.0816                                 -3.44                0.0006

                                                Table 1: Final model for the transparency of N1, R2 adjusted = 0.334


                                                                                                                0
                                                                                                              1.
                                                                4.


                                                                                                                                                                                        5
                                                                  0


                                                                               log synset count of N1


                            6
    log family size of N1


                                                                                                                       5
                                                                                                        3.0
                                                                                                                     1.


                                                                                                                                                        log frequency of N1
                                                        3.


                                                                                                                                                                              10
                                                                                                                                                                                                                      4
                                                          5


                                                                                                                             0
                            5                                                                           2.5                2.
                                                3.0


                                                                                                                                                                                                            3
                                                                                                                                    5
                            4                                                                           2.0                       2.                                           8
                                                                                                                                                                                                  2
                            3                                                                           1.5
                                                                                                                                                                                            1
                                   2.5


                                                                                                                                             3.0
                                                                                                                                                                               6
                                                                                                        1.0                                                                         0
                            2

                                   0.2    0.4     0.6     0.8                                                       0.2    0.4       0.6   0.8                                              0.2       0.4       0.6       0.8

                                relation proportion in N1family                                               synset proportion in N1family                                   compound proportion in N1 family (token-based)

                                                                      Figure 1. Interaction plots for N1 transparency


3                           Results                                                                                              4         Conclusion
All predictors in our model enter into significant                                                                               Overall, the model provides clear evidence for
interactions, and these are shown graphically in                                                                                 our hypothesis. N1 is rated as most transparent
Figure 1, where the contour lines on the plots                                                                                   when it is a frequent word, with a large family,
represent perceived transparency of the first con-                                                                               occurring with its preferred semantic relation and
stituent (N1). The first plot shows an interaction                                                                               most frequent sense, and with few other senses to
between relation proportion and overall (log)                                                                                    compete. We interpret the results as indicating
family size: for small families, relation propor-                                                                                that compound constituents are perceived as
tion plays little role, whereas for larger families,                                                                             more transparent when they are more expected
in accordance with our hypothesis, the transpar-                                                                                 (both generally and with a specific sense) and
ency of N1 increases with the proportion of the                                                                                  when they occur in their most expected semantic
corresponding relation in the family. The second                                                                                 environments. In information theory, the less
plot shows the interaction between the synset                                                                                    expected an event, the greater its information
proportion and the total number of a constitu-                                                                                   content: in so far as perceived transparency is a
ent’s senses (as listed in WordNet): only if there                                                                               reflection of expectedness, it can therefore also
is a sufficient number of different senses in the                                                                                be seen as the inverse of informativity.
family is their proportion a reliable predictor of
semantic transparency. There is also a small but                                                                                 Acknowledgements
significant interaction between the log frequency
                                                                                                                                 This work was made possible by three short visit
of a constituent and the proportion of the constit-
                                                                                                                                 grants from the European Science Foundation
uent family (in terms of tokens) represented by
                                                                                                                                 through NETWORDS - The European Network
the compound in question: this shows that trans-
                                                                                                                                 on Word Structure (grants 4677, 6520 and 7027),
parency increases with frequency, but only in the
                                                                                                                                 for which the authors are extremely grateful.
lower frequently ranges does the proportion in
the family play a role.


                                                                                                                           64
References
Bell, Melanie J. and Martin Schäfer. 2013. Semantic
   transparency: challenges for distributional
   semantics. In Aurelie Herbelot, Roberto
   Zamparelli and Gemma Boleda eds., Proceedings
   of the IWCS 2013 workshop: Towards a formal
   distributional   semantics,   1–10.     Potsdam:
   Association for Computational Linguistics.
Frisson, Steven, Elizabeth Niswander-Klement and
    Alexander Pollatsek. 2008. The role of semantic
    transparency in the processing of English com-
    pound words. British Journal of Psychology 991,
    87–107.
Levi, Judith N. 1978. The syntax and semantics of
   complex nominals. New York: Academic Press.
Marslen-Wilson, William, Lorraine K. Tyler,
  Rachelle Waksler and Lianne Older. 1994. Mor-
  phology and meaning in the English mental lexi-
  con. Psychological Review 101, 1: 3-33.
Munro, Robert, Steven Bethard, Victor Kuperman,
  Vicky Tzuyin Lai , Robin Melnick, Christopher
  Potts, Tyler Schnoebelen and Harry Tily. 2010.
  Crowdsourcing and language studies: the new
  generation of linguistic data. In Proceedings of the
  NAACL HLT 2010 Workshop on Creating Speech
  and Language Data with Amazon's Mechanical
  Turk, pp. 122-130. Association for Computational
  Linguistics.
Princeton University. 2010. WordNet.
<http://wordnet.princeton.edu>
Reddy, Siva, Diana McCarthy and Suresh Manandhar.
   2011. An empirical study on compositionality in
   compound nouns. In Proceedings of The 5th In-
   ternational Joint Conference on Natural Lan-
   guage Processing 2011 IJCNLP 2011, Chiang
   Mai, Thailand
Shaoul, Cyrus and Chris Westbury. 2010. An
   anonymized multi-billion word USENET corpus
   2005-2010
   http://www.psych.ualberta.ca/˜westburylab/downl
   oads/usenet.download.html
Spalding, Thomas L., Christina L. Gagné, Allison C.
   Mullaly and Hongbo Ji. 2010. Relation-based in-
   terpretation of noun-noun phrases: A new theoret-
   ical approach. Linguistische Berichte Sonderheft
   17, 283-315
Wurm, Lee H. 1997. Auditory processing of prefixed
  English words is both continuous and
  decompositional. Journal of Memory and Lan-
  guage, 37, 438–461.


                                                         65