=Paper=
{{Paper
|id=Vol-1347/paper17
|storemode=property
|title=A distributional semantics approach to implicit language learning
|pdfUrl=https://ceur-ws.org/Vol-1347/paper17.pdf
|volume=Vol-1347
|dblpUrl=https://dblp.org/rec/conf/networds/AlikaniotisW15
}}
==A distributional semantics approach to implicit language learning==
<pdf width="1500px">https://ceur-ws.org/Vol-1347/paper17.pdf</pdf>
<pre>
    A Distributional Semantics Approach to Implicit Language Learning

                             Dimitrios Alikaniotis      John N. Williams
                           Department of Theoretical and Applied Linguistics
                                       University of Cambridge
                          9 West Road, Cambridge CB3 9DP, United Kingdom
                                  {da352|jnw12}@cam.ac.uk


1   Introduction                                                to test for generalisation of the hidden regular-
                                                                ity. W and L&W report such a generalisation ef-
Vector-space models of semantics (VSMs) derive                  fect even in participants who remained unaware
word representations by keeping track of the co-                of the relevance of animacy to article usage – se-
occurrence patterns of each word when found in                  mantic implicit learning. Paciorek and Williams
large linguistic corpora. By exploiting the fact that           (2015) (P&W) report similar effects for a sys-
similar words tend to appear in similar contexts                tem in which novel verbs (rather than determiners)
(Harris, 1954), such models have been very suc-                 collocate with either abstract or concrete nouns.
cessful in tasks of semantic relatedness (Landauer              However, certain semantic constraints on seman-
and Dumais, 1997; Rohde et al., 2006). A com-                   tic implicit learning have been obtained. In P&W
mon criticism addressed towards such models is                  generalisation was weaker when tested with items
that those co-occurrence patterns do not explicitly             that were of relatively low semantic similarity to
encode specific semantic features unlike more tra-              the exemplars received in training. In L&W Chi-
ditional models of semantic memory (Collins and                 nese participants showed implicit generalisation
Quillian, 1969; Rogers and McClelland, 2004).                   of a system in which determiner usage was gov-
Recently, however, corpus studies (Bresnan and                  erned by whether the noun referred to a long or
Hay, 2008; Hill et al., 2013b) have shown that                  flat object (corresponding to the Chinese classifier
some ‘core’ conceptual distinctions such as ani-                system) whereas there was no such implicit gen-
macy and concreteness are reflected in the distri-              eralisation in native English speakers. Based on
butional patterns of words and can be captured by               this evidence we argue that the implicit learnabil-
such models (Hill et al., 2013a).                               ity of semantic regularities depends on the degree
                                                                to which the relevant concept is reflected in lan-
   In the present paper we argue that distributional
                                                                guage use. By forming semantic representations
characteristics of words are particularly important
                                                                of words based on their distributional character-
when considering concept availability under im-
                                                                istics we may be able to predict what would be
plicit language learning conditions. Studies on im-
                                                                learnable under implicit learning conditions.
plicit learning of form-meaning connections have
highlighted that during the learning process a re-
stricted set of conceptual distinctions are available
                                                                2       Simulation
such as those involving animacy and concreteness.               We obtained semantic representations using the
For example, in studies by Williams (2005) (W)                  skip-gram architecture (Mikolov et al., 2013)
and Leung and Williams (2014) (L&W) the partic-                 provided by the word2vec package,1 trained
ipants were introduced to four novel determiner-                with hierarchical softmax on the British National
like words: gi, ro, ul, and ne. They were explic-               Corpus or on a Chinese Wikipedia dump file of
itly told that they functioned like the article ‘the’           comparable size. The parameters used were as fol-
but that gi and ro were used with near objects                  lows: window size: B5A5, vector dimensionality:
and ro and ne with far objects. What they were                  300, subsampling threshold: t = e−3 only for the
not told was that gi and ul were used with living               English corpus.
things and ro and ne with non-living things. Par-                 The skip-gram model encapsulates the idea
ticipants were exposed to grammatical determiner-               of distributional semantics introduced above by
noun combinations in a training task and after-
wards given novel determiner-noun combinations                      1
                                                                        https://code.google.com/p/word2vec/


                  Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
 In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                           Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org
                                                           81
             1.0
                                Williams (2005)                                      0.35
                                                                                                     Paciorek & Williams (2015), exp. 1


             0.8


             0.6


                                                                 Activation
Activation


                                                                                     0.25

             0.4
                                                                                                  Abstract Grammatical
                                                                                                  Abstract Ungrammatical
             0.2
                    Grammatical                                                                   Concrete Grammatical
                    Ungrammatical                                                                 Concrete Ungrammatical
             0.0                                                                     0.15
                0     5         10           15   20   25                                0             5             10              15             20
                                     Epoch                                                                          Epoch
                                                                                     1.00
Figure 1: Generalisation gradients obtained from                                                                                    Grammatical
                                                                                                                                    Ungrammatical
the Williams (2005) dataset. The gradients were
obtained by averaging the output activations for                                     0.75

the grammatical and the ungrammatical pairs, re-

                                                                 Endorsement rates
spectively. The network hyperparameters used
                                                                                     0.50
were: learning rate: η = 0.01, weight decay:
γ = 0.01, size of hidden layer: h ∈ R100 . For this
and all the reported simulations the dashed verti-                                   0.25
cal lines mark the epoch in which the training error
approached zero. See text for more information on
the experiment.                                                                      0.00
                                                                                                       Abstract                 Concrete


                                                                 Figure 2: Results of our simulation along with
learning which contexts are more probable for a                  the behavioural results of Paciorek and Williams
given word. Concretely, it uses a neural network                 (2015), exp. 1. The hyperparameters used were
architecture, where each word from a large cor-                  the same as in the simulation of Williams (2005).
pus is presented in the input layer and its context
(i.e. several words around it) in the output layer.
The goal of the network is to learn a configuration              classifier (a feedforward neural network) the task
of weights such that when a word is presented in                 of which was to learn to associate noun represen-
the input layer the nodes in the output that become              tations to determiners or verbs, depending on the
more activated correspond to those words in the                  study in question. During the training phase, the
vocabulary, which had appeared more frequently                   neural network received as input the semantic vec-
as its context.                                                  tors of the nouns and the corresponding determin-
                                                                 ers/verbs (coded as 1-in-N binary vectors, where
   As argued above, the resulting representations
                                                                 N is the number of novel non-words)2 in the out-
will carry, by means of their distributional pat-
                                                                 put vector. Using backpropagation with stochas-
terns, semantic information such as concreteness
                                                                 tic gradient descent as the learning algorithm, the
or animacy. Consistent with the above hypothe-
                                                                 goal of the network was to learn to discriminate
ses, we predict that given a set of words in the
                                                                 between grammatical and ungrammatical noun –
training phase, the degree to which one can gen-
                                                                 determiner/verb combinations. We hypothesise
eralise to novel nouns will depend on how much
                                                                 that this could be possible if either specific fea-
the relevant concepts are reflected in the former
                                                                 tures of the input representation or a combination
words. If, for example, the words used during the
                                                                 of them contained the relevant concepts. Consid-
training session do not encode animacy based on
                                                                 ering the distributed nature of our semantic repre-
their co-occurrence statistics, albeit denoting an-
                                                                 sentations, we explore the latter option by adding
imate nouns, then generalising to other animate
                                                                 a tanh hidden layer, the purpose of which was to
nouns would be more difficult.
                                                                 extract non-linear combinations of features of the
   In order to examine this prediction, we fed the
resulting semantic representations to a non-linear                                   2
                                                                                         All the studies reported use four novel non-words.


                                                            82
                    0.35
                                 Paciorek & Williams (2015), exp. 4                                     1.0
                                                                                                                    Leung & Williams (2014), exp. 3


                                                                                                        0.8


Activation                                                                                              0.6


                                                                                           Activation
                    0.25

                                                                                                        0.4
                                                       Abstract Grammatical
                                                       Abstract Ungrammatical
                                                                                                        0.2
                                                       Concrete Grammatical                                    Chinese Grammatical
                                                       Concrete Ungrammatical                                  English Grammatical
                    0.15                                                                                0.0
                        0         5             10               15             20                         0                      250                        500
                                               Epoch                                                                             Epoch
                    1.00                                                                                1300
                                                                Grammatical                                                                      Grammatical
                                                                                                                                                  1204
                                                                Ungrammatical                                                                    Ungrammatical
                    0.75

                                                                                                        1200
Endorsement rates


                                                                                           RT (ms)
                    0.50


                                                                                                        1100
                    0.25


                    0.00                                                                                1000
                                 Abstract                   Concrete                                                 English                  Chinese


Figure 3: Results of our simulation along with                                             Figure 4: Results from Leung and Williams
the behavioural results of Paciorek and Williams                                           (2014), exp. 3. See text for more info on the mea-
(2015), exp. 4. The hyperparameters used were                                              sures used. The gradients for the ungrammatical
the same as in the simulation of Williams (2005).                                          combinations are (1 − grammatical). The value of
                                                                                           the weight decay was set to γ = 0.05 while the
                                                                                           rest of the hyperparameters used were the same as
input vector. We then recorded the generalisation
                                                                                           in the simulation of Williams (2005).
ability through time (epochs) of our classifier by
simply asking what would be the probability of
encountering a known determiner k with a novel
word w ~ by taking the softmax function:                                                      If the model has been successful in learning that
                                                                                           ‘gi’ should be activated more given animate con-
                                                                                           cepts then the probability P (y = gi|w   ~ lion ) would
                                                  exp (netk )                              be higher than P (y = ro|w ~ lion ). Fig. 1 shows the
                                    ~ =P
                            p(y = k|w)                            .             (1)
                                               k0 ∈K exp (netk0 )                          performance of the classifier on the testing set of
                                                                                           W where, in the behavioural data, selection of the
3                     Results and Discussion                                               grammatical item was significantly above chance
Figures 1-4 show the results of the simulations                                            in a two alternative forced choice task for the un-
across four different datasets which reflect differ-                                       aware group. The slopes of the gradients clearly
ent semantic manipulations. The simulations show                                           show that on such a task the model would favour
the generalisation gradients obtained by applying                                          grammatical combinations as well.
eq. (1) to every word in the generalisation set and                                           Figures 2-3 plot the results of two experiments
then keeping track of the activation of the different                                      from P&W which focused on the abstract/concrete
determiners (W, L&W) or verbs (P&W) through                                                distinction. P&W used a false memory task in the
time. For example, in W where the semantic dis-                                            generalisation phase, measuring learning by com-
tinction was between animate and inanimate con-                                            paring the endorsement rates between novel gram-
cepts ‘gi lion’ would be considered a grammatical                                          matical and novel ungrammatical verb-noun pairs.
sequence while ‘ro lion’ an ungrammatical one.                                             It was reasoned that if the participants had some


                                                                                      83
knowledge of the system they would endorse more                References
novel grammatical sequences. Expt 1 (Fig. 2) used
                                                               Bresnan, J. and Hay, J. (2008). Gradient grammar:
generalisation items that were higher in seman-
                                                                 An effect of animacy on the syntax of give in
tic similarity to trained items than was the case in
                                                                 New Zealand and American English. Lingua,
Expt 4 (Fig. 3). The behavioural results from the
                                                                 118(2):245–259.
unaware groups (bottom rows) show that this ma-
nipulation resulted in larger grammaticality effects           Collins, A. M. and Quillian, M. R. (1969). Re-
on familiarity judgements in Expt 1 than Expt 4,                 trieval time from semantic memory. Jour-
and also higher endorsements for concrete items                  nal of Verbal Learning and Verbal Behavior,
in general in Expt 1. Our simulation was able to                 8(2):240–247.
capture both of these effects.                                 Harris, Z. (1954). Distributional structure. Word,
   L&W Expt 3 examined the learnability of a sys-                10(23):146–162.
tem based on a long/flat distinction, which is re-
                                                               Hill, F., Kiela, D., and Korhonen, A. (2013a).
flected in the distributional patterns of Chinese but
                                                                 Concreteness and Corpora: A Theoretical and
not of English. In Chinese, nouns denoting long
                                                                 Practical Analysis. In Proceedings of the Work-
objects have to be preceded by a specific classi-
                                                                 shop on Cognitive Modeling and Computa-
fier while flat object nouns by another. L&W’s
                                                                 tional Linguistics, pages 75–83.
training phase consisted of showing to participants
combinations of thin/flat objects with novel deter-            Hill, F., Korhonen, A., and Bentz, C. (2013b).
miners, asking them to judge whether the noun                    A Quantitative Empirical Analysis of the Ab-
was thin or flat. After a period of exposure, partic-            stract/Concrete Distinction. Cognitive Science,
ipants were introduced to novel determiner – noun                38(1):162–177.
combinations, which either followed the grammat-               Landauer, T. K. and Dumais, S. T. (1997). A so-
ical system (control trials) or did not (violation tri-          lution to Plato’s problem: The latent semantic
als). Participants had significantly lower reaction              analysis theory of acquisition, induction, and
times (Fig. 4, bottom row) when presented with a                 representation of knowledge. Psychological Re-
novel grammatical sequence than an ungrammat-                    view, 104(2):211–240.
ical sequence, an effect not observed in the RTs               Leung, J. H. C. and Williams, J. N. (2014).
of the English participants. The corresponding re-               Crosslinguistic Differences in Implicit Lan-
sults of our simulations plotted in Fig. 4 show that             guage Learning. Studies in Second Language
indeed the regularity was learnable when the se-                 Acquisition, 36(4):733–755.
mantic model had only experienced a Chinese text,
but not when it experienced the English corpus.                Mikolov, T., Sutskever, I., Chen, K., Corrado,
   While more direct evidence is needed to support              G. S., and Dean, J. (2013). Distributed repre-
our initial hypothesis, our results seem to point               sentations of words and phrases and their com-
to the direction that semantic information encoded              positionality. In Advances in Neural Informa-
by the distributional characteristics of words when             tion Processing Systems, pages 3111–3119.
found in large corpora can be important in deter-              Paciorek, A. and Williams, J. (2015). Seman-
mining what could be implicitly learnable.                       tic generalisation in implicit language learning.
                                                                 Journal of Experimental Psychology: Learning,
                                                                 Memory and Cognition.
                                                               Rogers, T. T. and McClelland, J. L. (2004). Se-
                                                                 mantic Cognition: A Parallel Distributed Pro-
                                                                 cessing Approach. MIT Press.
                                                               Rohde, D., Gonnerman, L. M., and Plaut, D. C.
                                                                 (2006). An improved model of semantic simi-
                                                                 larity based on lexical co-occurrence. Commu-
                                                                 nications of the ACM.
                                                               Williams, J. N. (2005). Learning without aware-
                                                                ness. Studies in Second Language Acquisition,
                                                                27:269–304.


                                                          84

</pre>