=Paper=
{{Paper
|id=Vol-1347/paper17
|storemode=property
|title=A distributional semantics approach to implicit language learning
|pdfUrl=https://ceur-ws.org/Vol-1347/paper17.pdf
|volume=Vol-1347
|dblpUrl=https://dblp.org/rec/conf/networds/AlikaniotisW15
}}
==A distributional semantics approach to implicit language learning==
A Distributional Semantics Approach to Implicit Language Learning Dimitrios Alikaniotis John N. Williams Department of Theoretical and Applied Linguistics University of Cambridge 9 West Road, Cambridge CB3 9DP, United Kingdom {da352|jnw12}@cam.ac.uk 1 Introduction to test for generalisation of the hidden regular- ity. W and L&W report such a generalisation ef- Vector-space models of semantics (VSMs) derive fect even in participants who remained unaware word representations by keeping track of the co- of the relevance of animacy to article usage – se- occurrence patterns of each word when found in mantic implicit learning. Paciorek and Williams large linguistic corpora. By exploiting the fact that (2015) (P&W) report similar effects for a sys- similar words tend to appear in similar contexts tem in which novel verbs (rather than determiners) (Harris, 1954), such models have been very suc- collocate with either abstract or concrete nouns. cessful in tasks of semantic relatedness (Landauer However, certain semantic constraints on seman- and Dumais, 1997; Rohde et al., 2006). A com- tic implicit learning have been obtained. In P&W mon criticism addressed towards such models is generalisation was weaker when tested with items that those co-occurrence patterns do not explicitly that were of relatively low semantic similarity to encode specific semantic features unlike more tra- the exemplars received in training. In L&W Chi- ditional models of semantic memory (Collins and nese participants showed implicit generalisation Quillian, 1969; Rogers and McClelland, 2004). of a system in which determiner usage was gov- Recently, however, corpus studies (Bresnan and erned by whether the noun referred to a long or Hay, 2008; Hill et al., 2013b) have shown that flat object (corresponding to the Chinese classifier some ‘core’ conceptual distinctions such as ani- system) whereas there was no such implicit gen- macy and concreteness are reflected in the distri- eralisation in native English speakers. Based on butional patterns of words and can be captured by this evidence we argue that the implicit learnabil- such models (Hill et al., 2013a). ity of semantic regularities depends on the degree to which the relevant concept is reflected in lan- In the present paper we argue that distributional guage use. By forming semantic representations characteristics of words are particularly important of words based on their distributional character- when considering concept availability under im- istics we may be able to predict what would be plicit language learning conditions. Studies on im- learnable under implicit learning conditions. plicit learning of form-meaning connections have highlighted that during the learning process a re- stricted set of conceptual distinctions are available 2 Simulation such as those involving animacy and concreteness. We obtained semantic representations using the For example, in studies by Williams (2005) (W) skip-gram architecture (Mikolov et al., 2013) and Leung and Williams (2014) (L&W) the partic- provided by the word2vec package,1 trained ipants were introduced to four novel determiner- with hierarchical softmax on the British National like words: gi, ro, ul, and ne. They were explic- Corpus or on a Chinese Wikipedia dump file of itly told that they functioned like the article ‘the’ comparable size. The parameters used were as fol- but that gi and ro were used with near objects lows: window size: B5A5, vector dimensionality: and ro and ne with far objects. What they were 300, subsampling threshold: t = e−3 only for the not told was that gi and ul were used with living English corpus. things and ro and ne with non-living things. Par- The skip-gram model encapsulates the idea ticipants were exposed to grammatical determiner- of distributional semantics introduced above by noun combinations in a training task and after- wards given novel determiner-noun combinations 1 https://code.google.com/p/word2vec/ Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org 81 1.0 Williams (2005) 0.35 Paciorek & Williams (2015), exp. 1 0.8 0.6 Activation Activation 0.25 0.4 Abstract Grammatical Abstract Ungrammatical 0.2 Grammatical Concrete Grammatical Ungrammatical Concrete Ungrammatical 0.0 0.15 0 5 10 15 20 25 0 5 10 15 20 Epoch Epoch 1.00 Figure 1: Generalisation gradients obtained from Grammatical Ungrammatical the Williams (2005) dataset. The gradients were obtained by averaging the output activations for 0.75 the grammatical and the ungrammatical pairs, re- Endorsement rates spectively. The network hyperparameters used 0.50 were: learning rate: η = 0.01, weight decay: γ = 0.01, size of hidden layer: h ∈ R100 . For this and all the reported simulations the dashed verti- 0.25 cal lines mark the epoch in which the training error approached zero. See text for more information on the experiment. 0.00 Abstract Concrete Figure 2: Results of our simulation along with learning which contexts are more probable for a the behavioural results of Paciorek and Williams given word. Concretely, it uses a neural network (2015), exp. 1. The hyperparameters used were architecture, where each word from a large cor- the same as in the simulation of Williams (2005). pus is presented in the input layer and its context (i.e. several words around it) in the output layer. The goal of the network is to learn a configuration classifier (a feedforward neural network) the task of weights such that when a word is presented in of which was to learn to associate noun represen- the input layer the nodes in the output that become tations to determiners or verbs, depending on the more activated correspond to those words in the study in question. During the training phase, the vocabulary, which had appeared more frequently neural network received as input the semantic vec- as its context. tors of the nouns and the corresponding determin- ers/verbs (coded as 1-in-N binary vectors, where As argued above, the resulting representations N is the number of novel non-words)2 in the out- will carry, by means of their distributional pat- put vector. Using backpropagation with stochas- terns, semantic information such as concreteness tic gradient descent as the learning algorithm, the or animacy. Consistent with the above hypothe- goal of the network was to learn to discriminate ses, we predict that given a set of words in the between grammatical and ungrammatical noun – training phase, the degree to which one can gen- determiner/verb combinations. We hypothesise eralise to novel nouns will depend on how much that this could be possible if either specific fea- the relevant concepts are reflected in the former tures of the input representation or a combination words. If, for example, the words used during the of them contained the relevant concepts. Consid- training session do not encode animacy based on ering the distributed nature of our semantic repre- their co-occurrence statistics, albeit denoting an- sentations, we explore the latter option by adding imate nouns, then generalising to other animate a tanh hidden layer, the purpose of which was to nouns would be more difficult. extract non-linear combinations of features of the In order to examine this prediction, we fed the resulting semantic representations to a non-linear 2 All the studies reported use four novel non-words. 82 0.35 Paciorek & Williams (2015), exp. 4 1.0 Leung & Williams (2014), exp. 3 0.8 Activation 0.6 Activation 0.25 0.4 Abstract Grammatical Abstract Ungrammatical 0.2 Concrete Grammatical Chinese Grammatical Concrete Ungrammatical English Grammatical 0.15 0.0 0 5 10 15 20 0 250 500 Epoch Epoch 1.00 1300 Grammatical Grammatical 1204 Ungrammatical Ungrammatical 0.75 1200 Endorsement rates RT (ms) 0.50 1100 0.25 0.00 1000 Abstract Concrete English Chinese Figure 3: Results of our simulation along with Figure 4: Results from Leung and Williams the behavioural results of Paciorek and Williams (2014), exp. 3. See text for more info on the mea- (2015), exp. 4. The hyperparameters used were sures used. The gradients for the ungrammatical the same as in the simulation of Williams (2005). combinations are (1 − grammatical). The value of the weight decay was set to γ = 0.05 while the rest of the hyperparameters used were the same as input vector. We then recorded the generalisation in the simulation of Williams (2005). ability through time (epochs) of our classifier by simply asking what would be the probability of encountering a known determiner k with a novel word w ~ by taking the softmax function: If the model has been successful in learning that ‘gi’ should be activated more given animate con- cepts then the probability P (y = gi|w ~ lion ) would exp (netk ) be higher than P (y = ro|w ~ lion ). Fig. 1 shows the ~ =P p(y = k|w) . (1) k0 ∈K exp (netk0 ) performance of the classifier on the testing set of W where, in the behavioural data, selection of the 3 Results and Discussion grammatical item was significantly above chance Figures 1-4 show the results of the simulations in a two alternative forced choice task for the un- across four different datasets which reflect differ- aware group. The slopes of the gradients clearly ent semantic manipulations. The simulations show show that on such a task the model would favour the generalisation gradients obtained by applying grammatical combinations as well. eq. (1) to every word in the generalisation set and Figures 2-3 plot the results of two experiments then keeping track of the activation of the different from P&W which focused on the abstract/concrete determiners (W, L&W) or verbs (P&W) through distinction. P&W used a false memory task in the time. For example, in W where the semantic dis- generalisation phase, measuring learning by com- tinction was between animate and inanimate con- paring the endorsement rates between novel gram- cepts ‘gi lion’ would be considered a grammatical matical and novel ungrammatical verb-noun pairs. sequence while ‘ro lion’ an ungrammatical one. It was reasoned that if the participants had some 83 knowledge of the system they would endorse more References novel grammatical sequences. Expt 1 (Fig. 2) used Bresnan, J. and Hay, J. (2008). Gradient grammar: generalisation items that were higher in seman- An effect of animacy on the syntax of give in tic similarity to trained items than was the case in New Zealand and American English. Lingua, Expt 4 (Fig. 3). The behavioural results from the 118(2):245–259. unaware groups (bottom rows) show that this ma- nipulation resulted in larger grammaticality effects Collins, A. M. and Quillian, M. R. (1969). Re- on familiarity judgements in Expt 1 than Expt 4, trieval time from semantic memory. Jour- and also higher endorsements for concrete items nal of Verbal Learning and Verbal Behavior, in general in Expt 1. Our simulation was able to 8(2):240–247. capture both of these effects. Harris, Z. (1954). Distributional structure. Word, L&W Expt 3 examined the learnability of a sys- 10(23):146–162. tem based on a long/flat distinction, which is re- Hill, F., Kiela, D., and Korhonen, A. (2013a). flected in the distributional patterns of Chinese but Concreteness and Corpora: A Theoretical and not of English. In Chinese, nouns denoting long Practical Analysis. In Proceedings of the Work- objects have to be preceded by a specific classi- shop on Cognitive Modeling and Computa- fier while flat object nouns by another. L&W’s tional Linguistics, pages 75–83. training phase consisted of showing to participants combinations of thin/flat objects with novel deter- Hill, F., Korhonen, A., and Bentz, C. (2013b). miners, asking them to judge whether the noun A Quantitative Empirical Analysis of the Ab- was thin or flat. After a period of exposure, partic- stract/Concrete Distinction. Cognitive Science, ipants were introduced to novel determiner – noun 38(1):162–177. combinations, which either followed the grammat- Landauer, T. K. and Dumais, S. T. (1997). A so- ical system (control trials) or did not (violation tri- lution to Plato’s problem: The latent semantic als). Participants had significantly lower reaction analysis theory of acquisition, induction, and times (Fig. 4, bottom row) when presented with a representation of knowledge. Psychological Re- novel grammatical sequence than an ungrammat- view, 104(2):211–240. ical sequence, an effect not observed in the RTs Leung, J. H. C. and Williams, J. N. (2014). of the English participants. The corresponding re- Crosslinguistic Differences in Implicit Lan- sults of our simulations plotted in Fig. 4 show that guage Learning. Studies in Second Language indeed the regularity was learnable when the se- Acquisition, 36(4):733–755. mantic model had only experienced a Chinese text, but not when it experienced the English corpus. Mikolov, T., Sutskever, I., Chen, K., Corrado, While more direct evidence is needed to support G. S., and Dean, J. (2013). Distributed repre- our initial hypothesis, our results seem to point sentations of words and phrases and their com- to the direction that semantic information encoded positionality. In Advances in Neural Informa- by the distributional characteristics of words when tion Processing Systems, pages 3111–3119. found in large corpora can be important in deter- Paciorek, A. and Williams, J. (2015). Seman- mining what could be implicitly learnable. tic generalisation in implicit language learning. Journal of Experimental Psychology: Learning, Memory and Cognition. Rogers, T. T. and McClelland, J. L. (2004). Se- mantic Cognition: A Parallel Distributed Pro- cessing Approach. MIT Press. Rohde, D., Gonnerman, L. M., and Plaut, D. C. (2006). An improved model of semantic simi- larity based on lexical co-occurrence. Commu- nications of the ACM. Williams, J. N. (2005). Learning without aware- ness. Studies in Second Language Acquisition, 27:269–304. 84