=Paper=
{{Paper
|id=Vol-1419/paper0120
|storemode=property
|title=Learning of Time Varying Functions is Based on Association Between Successive Stimuli
|pdfUrl=https://ceur-ws.org/Vol-1419/paper0120.pdf
|volume=Vol-1419
|dblpUrl=https://dblp.org/rec/conf/eapcogsci/YangL15
}}
==Learning of Time Varying Functions is Based on Association Between Successive Stimuli==
Learning of Time Varying Functions is Based on Association Between Successive Stimuli Lee-Xieng Yang (lxyang@nccu.edu.tw) Department of Psychology, Researcher Center for Mind, Brain and Learning National Chengchi University, No.64, Sec.2, ZhiNan Rd., Taipei City 11605, Taiwan (R.O.C) Tzu-Hsi Lee (103752010@nccu.edu.tw) Department of Psychology, National Chengchi University No.64, Sec.2, ZhiNan Rd., Taipei City 11605, Taiwan (R.O.C) Abstract theoretical account, a hybrid model combining these two ap- proaches is proposed (McDaniel & Busemeyer, 2005). In function learning, the to-be-learned function is normally Although these models differ on the assumption for the designed as time invariant. However, when the magnitudes of variable can be defined by time points, the function varies type of representation formed in function learning, it is basi- along time. Due to this difference in essence, the learning of cally agreed that the representation is formed for the whole the time-varying functions would be different from other func- function. However, contrary to this idea, it was found tions. Specifically, the correlation between successive stimuli should play an important role for learning such functions. In that people might form different representations for differ- this study, three experiments were conducted with the corre- ent parts of the function, such that a quadratic function lations set as positive high, negative high, and positive low. was learned as the composition of two simpler monotonic The results show people perform well when the correlation between successive stimuli is positive high or negative high. functions, which were chosen for use at different contexts Also, people have difficulty learning the time-varying function (Lewandowsky, Kalish, & Ngang, 2002). The POLE model with a low correlation between successive stimuli. A simple (Kalish, Lewandowsky, & Kruschke, 2004) accounts for this two-layered neural network model is evident to be able to pro- vide good accounts for the data of all experiments. These re- finding well, by virtue of its architecture consisting of many sults suggest that learning time varying function is based on modules, each of which represents a linear function corre- association between successive stimuli. sponding only to a small region of the function, and a gating Keywords: Function Learning; Time Varying Function mechanism which always chooses one of the modules for use according to the stimulus value. Strictly speaking, the real Function Learning function is not learned but approximated by the composition of many smaller linear functions. We are living in an orderly world, in which variables are Past studies have tested different functions and shown a mostly correlated with each other. For instance, the proba- number of characteristics of function learning. First, the lin- bility of rain might be a function of the extent to which the ear functions are easier to learn than the nonlinear ones (see sky is overcast with dark clouds, or the distance to the car in Busemeyer et al., 1997; Koh & Meyer, 1991). Second, it is front needed to avoid a car crash is a function of the current found that it is more accurate to predict the response for the car speed. The study of how people learn a function and what stimulus whose value falls in the training range (i.e., interpo- people form to represent a learned function is referred to as lation) than outside the range (i.e., extrapolation) (see Buse- function learning. meyer et al., 1997; McDaniel & Busemeyer, 2005). Third, There are also two contrasting theoretical accounts in func- although the function of simpler forms (e.g., linear or power tion learning. The rule-based account posits that people function) can be learned with the variables being of non- construct abstract rules to summarize the ensemble of ex- numeric forms (e.g., line length), Kalish (2013) reported that perienced pairs of stimuli and responses used to teach the the periodic functions (e.g., sine function) cannot be learned function. Most frequently, polynomial rules have been pro- without the employment of numeric stimuli. These character- posed as the representations of the mappings between stimu- istics reveal the limitations of human cognition for learning lus magnitudes and response magnitudes (see Carroll, 1963; the functional relation between variables. Koh & Meyer, 1991). On the contrary, the associative- based model assumes that people form direct associations be- Time-Varying Function tween each stimulus and corresponding response without ab- Although many forms of functions have been tested, a partic- stracting any summary information (Busemeyer, Byun, De- ular form of function, which maps the timing of observation losh, & McDaniel, 1997; DeLosh, Busemeyer, & McDaniel, to the event at that timing seems not to have been tested yet. 1997). However, the rule-based account overestimates the We call this function as time-varying function in this article, participants’ performance in the extrapolation test but the y = f (t). An example of this function would be the height of associative-based model underestimates it. To get a better water accumulated in a bucket from a constant supply source. 722 If the bucket is cylindrical, the height will be a linear func- was randomly sampled from the uniform distribution between tion of time and if the bucket is conical, the height will be -0.5 and 0.5. All stimulus values were normalized between - a parabolic function of time. To our knowledge, how people 15 and 15 for the convenience of computer programming. It learn this kind of function has never been reported in litera- was reasonable to expect that this function could be learned ture. However, a relevant case in category learning has been well, for (1) it was linear as well as (2) the correlation be- reported recently. tween successive stimuli was high. Navarro and his colleagues tested how people could learn the categories when the category structure varies along train- Method ing trials. In one of their experiments, the members of two categories moved up on the stimulus dimension constantly along with the increase of trail number and the categorization Participants and Appartus There were in total 22 partic- rule was set up as ”Respond A, if xt > t and B otherwise” ipants recruited from National Chengchi University in Tai- for any item xt on trial t. Their results showed that partic- wan for this experiment. Each participant was reimbursed by ipants could not only learn this category structure, but also NTD$ 60 (' US$ 2) for their time and traffic expense. The be able to predict the item value on the next trial (Navarro whole experiment was conducted on an IBM compatible PC & Perfors, 2009, 2012; Navarro, Perfors, & Vong, 2013). It in a quiet booth. The processes of stimulus displaying and re- is implied that people are able to capture some functional re- sponse recording were under the control of a computer script lationships between the time point (or trial number) and the composed by PsychoPy (Peirce, 2007). stimulus value. However, the learning of the time-varying functions might be different from the normal functions. Session 1 Session 2 15 15 Comparison Between Time-Varying Function and Normal Function 10 10 There are some features of the time-varying functions worth 5 5 noting. First, due to that time can never return, when learn- ing a time-varying function, making a prediction for response position position 0 0 magnitude on each trial is always extrapolating what people have learned. However, in the case of learning the function −5 −5 y = f (x), both the interpolation and extrapolation tests can be conducted. −10 −10 Second, a time-varying function can be viewed as a func- tion defining the relationships between successive stimuli, −15 −15 1 7 131925313743495561677379859197 1 7 131925313743495561677379859197 xt = f (xt−1 ). A good example is the game of throwing a Fris- trial trial bee with friends. In this case, the only observable information is the spatial position of the Frisbee at any time point. There- fore, the best cue for us to estimate the position of the Frisbee Figure 1: The stimulus structure in Experiment 1 (i.e., at time t is its position at time t − 1. crosses) and the participants’ predictions (i.e., circles) in Ses- Third, the learnability or complexity of function would be sion 1 averaged across all participants. defined differently for the time-varying function. For the case of y = f (x), the linear function has less parameters to esti- Procedure The participants were instructed that they were mate than the quadratic function, hence being easier to learn. playing a shooting game. In this game, they had to guess For the case of y = f (t), learning the functional relationship the position of a target on a horizontal line on the computer between time point to response magnitude is equivalent to screen. On each trial, they moved the mouse cursor to where learning to predict the next response magnitude with the cur- they thought the target would appear. After they pressed the rent observed response magnitude. Thus, it is hypothesized space key to complete the guessing, the target would appear that the time-varying function would be easy to learn, if the as an arrow on the correct position, together with a feedback correlation between successive stimuli is high. If the corre- text of ”Hit” or ”Miss” on the screen. The participants were lation between successive stimuli is low, it would be hard to told that ”Hit” meant that your guess was close enough to the learn. To verify this hypothesis, three experiments were con- true answer and otherwise you would get ”Miss”. The whole ducted. experiment was conducted in two sessions, each of which consisted of 100 trials. The same100 stimuli were presented Experiment 1 in the two sessions. The distance between the target’s correct In this experiment, we first examined whether people can position and the participants’ guess was error. The amount learn a linear time-varying function. The function was written of squared error and the proportion of received ”Hit” (e.g., as xt = t + εt , where t was trial number from 1 to 100 and ε accuracy) were the dependent variable in this experiment. 723 Results Session 1 Session 2 15 15 Visual inspection on Figure 1 shows that participants per- formed quite well except for the very early trials1 . For sim- 10 10 plifying the complexity of data analysis, we divided the 100 stimuli to 10 blocks. The squared prediction error decreases 5 5 from 40.29 to 0.03 with the mean = 4.06 through 10 blocks position position across two sessions. A Block (10) × Session (2) within- 0 0 subjects ANOVA reveals a significant main effect of Block −5 −5 on the squared error [F(9, 189) = 72.83, MSe = 98, p < .01], no significant main effect of Session [F(1, 21) = 2.367, MSe −10 −10 = 166.30, p = .139], and a significant interaction effect be- tween Block and Session [F(9, 189) = 2.346, MSe = 166.3, p < .05]. 1 7 131925313743495561677379859197 trial 1 7 131925313743495561677379859197 trial The participant’s accuracy is another dependent variable, which is computed as the number of ”Hit” divided by all trials. Due to the ”Hit” range was very small in our ex- Figure 2: The stimulus structure in Experiment 2 (i.e., periments, the highest accuracy in a block was .63 and the crosses) and the participants’ predictions (i.e., circles) in Ses- lowest was .36 across all sessions. A Block (10) × Session sion 1 averaged across all participants. (2) within-subjects ANOVA shows a significant main effect of Block on the accuracy [F(9, 189) = 8.281, MSe = 0.028, p < .01], no significant main effect of Session [F(1, 21) < 1], Results and a significant interaction effect between Block and Session See the circles and crosses in Figure 2. Apparently, the partic- [F(9, 189) = 5.052, MSe = 0.027, p < .01]. ipants could capture the moving pattern of the target, although We also check the correlation between each participant’s on the early trials, they made some larger errors. Similar to predictions and the true answers. The averaged Pearson’s r what we found in Experiment 1, the squared prediction er- across all participants is quite high [r = .97]. Together with ror drops along blocks from 73.79 to 1.57 (mean = 15.35) the visual inspection on Figure 1, it is confirmed that people across two sessions. A Block (10) × Session (2) within- can learn the linear time-varying function very well. subjects ANOVA reveals a significant main effect of Block [F(9, 180) = 14.24, MSe = 1303, p < .01], a significant main Experiment 2 effect of Session [F(1, 20) = 17.22, MSe = 196, p < .01], and a significant interaction effect between Block and Ses- In this√experiment, the function was set up as xt = 50 + sion [F(9, 180) = 16.12, MSe = 177.8, p < .01]. Although (−1)t 100 − t, which made the target jump left and right, the error curve goes down toward 0, the mean squared predic- gradually moving toward the central point. Obviously, this tion error is 15.53 far larger than that in Experiment 1, which function was far more complex than the one used in Exper- is 4.06. This suggests that the linear function is easier to learn iment 1 and it was nonlinear. If the learning of y = f (t) than the quadratic function. shared the same characteristics of the learning of y = f (x), The accuracy data also suggest that this function is harder it should be expected that this function could not be learned to learn than the linear function with the mean highest ac- well. However, if our discussion about the characteristics of curacy in a block across all participants and sessions as .34 time-varying function was right, it should be expected that and the lowest as .14. A Block (10) × Session (2) within- this function could be learned well, due to high correlation subjects ANOVA reveals a significant main effect of block between successive stimuli [r = −.99]. [F(9, 180) = 9.747, MSe = 0.018, p < .01], no significant main effect of Session [F(1, 20) < 1], and no significant in- Method teraction effect between Block and Session [F(9, 180) < 1]. Although the accuracy is quite low, this does not mean that Participants and Apparatus There were in total 21 par- people cannot learn this function. As shown in Figure 2, the ticipants recruited from National Chengchi University in Tai- participants’ predictions are close to the true answers. Also, wan for this experiment. Each participant was reimbursed by the correlation between each participant’s predictions and the NTD$ 60 (' US$ 2) for their time and traffic expense. The true answers is considerably high [mean r = .92]. As ex- testing materials and procedure are all the same as those in pected, the participants can learn this complex time-varying Experiment 1. function. 1 For making the figure easier to read, we plot the human pre- Experiment 3 diction by circles and the correct answers by crosses on only the even-numbered trials in the first session. The result pattern is the In this experiment, we would like to examine whether peo- same in the second session. ple could predict the stimulus magnitudes, when the corre- 724 lation between successive stimuli was lower. See Figure 3 The squared prediction error drops from 69.69 to 42.47 as an example, which was the real case for testing one par- along blocks in Session 1 and has no clear change from 23.12 ticipant2 . The dashed line showed the true moving pattern to 24.30 in Session 2. Although the performance gets better of the stimulus, which was generated by y = g[a] + z[b + 1], in Session 2, the prediction error never goes close to 0. The where a = b((t + 4)/5)c, b = t mod 5, g was the random per- mean squared error for all participants across blocks and ses- mutation of the vector [1,6,11,...,96], and for each g, z was a sions is 30.844, which is larger than 15.53 (mean error in Ex- new random permutation of the vector [1,2,3,4,5]. The cor- periment 2) and 4.06 (mean error in Experiment 1). Thus, the relations between successive stimuli were averaged across all learning performance in this experiment is the worst among participants and all sessions as r = .80, which was lower than the three experiments in this study. the correlations in the previous experiments. With no matter As done for the previous experiments, a Block (10) × Ses- which view to look at this form (i.e., number of parameters sion (2) within-subjects ANOVA was conducted for the pre- to estimate or correlation between successive stimuli), it was diction error. The results show no significant main effect of expected that this function could not be learned well. Block [F(9, 153) = 1.53, MSe = 998.4, p = .142], a signif- icant main effect of Session [F(1, 17) = 14.94, MSe = 424, p < .01], and a significant interaction effect between Block Session 1 15 and Session [F(9, 153) = 3.206, MSe = 701.6, p < .01]. The mean accuracy in a block across all sessions is even 10 lower than that in the other two experiments. The high- est mean accuracy is about .11 and the lowest is .06. It 5 is clear that the participants cannot capture the moving pat- tern of the stimulus. A Block (10) × Session (2) within- position 0 subject ANOVA shows no main effect of Block on accuracy −5 [F(9, 153) = 1.179, MSe = 0.006, p = .312], no main effect of Session [F(1, 17) = 3.367, MSe = 0.006, p = .08], and no −10 interaction effect between Block and Session [F(9, 153) < 1]. We also computed the Person’s r for each participant’s pre- −15 diction and the true answer. Although the mean correlation is 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 trial not low (r = .76), this finding might result from the fact that the participants’ prediction is always one step behind the true answer. To sum up, the linear function is the easiest to learn Figure 3: The stimulus structure in Experiment 3 (i.e., and the quadratic function is the second. Basically, partici- crosses) and predictions of participant #14 (i.e., circles). pants cannot learn the complex function in Experiment 3. In order to get a better understanding about the underly mecha- nism for learning the time-varying functions, we developed a Method neural network model for the learning of time-varying func- Participants and Apparatus There were in total 18 partic- tions. ipants recruited for this experiment from National Chengchi University in Taiwan. Each participant was reimbursed by Model for Learning Time Varying Function NTD$ 60 (' US$ 2) for their time and traffic expense. The A time-varying function can be rewritten as xt = f (xt−1 ) and testing materials and procedure are all the same as those in the simplest form of it would be xt = β0 + β1 xt−1 . Thus, Experiment 1. learning a time-varying function is equivalent to estimating the optimal parameter values, with which the model makes Results the smallest error. To this end, a simple two-layered neural As shown in Figure 3, apparently, the participant could not network is proposed. There are two input nodes, which re- predict the target position. Otherwise, we will see the dashed spectively correspond to the position of the stimulus on the line (for answers) and solid line (for participant’s predictions) preceding trial xt−1 and the standard moving distance which superimpose on each other. However, the response pattern is is set as 1. There is only one output node corresponding to the not random either. In fact, the participant’s predictions seem predicted position on the current trail x̂t = w1 × 1 + w2 xt−1 . always to be one step behind the true answers. Although we The associative weight w1 represents the size of moving dis- do not show the predictions of the rest 17 participants, their tance. The weight w2 represents how much correlated the last predictions are one step behind the true answers also. Thus, position is with the current position. When the true answer xt strictly speaking, we do not think that the participants learned is provided, the error is then computed as xt − x̂t . this function. The associative weights are updated with WH algorithm3 2 Different participants received different moving patterns to 3 This algorithm is a special case of backpropagation algorithm, learn. which is specifically used for two-layered neural network models. 725 (Abdi, Valentin, & Edelman, 1999) to decrease the error by moving it a certain distance (i.e., 0.30 times of the stan- made by the model. Also, we make the updating amount for dard moving size) from the place a bit behind (i.e., 70%) the weights decay all the way through training trials. Thus, the position just seen in the same direction of the last move. updated amount for w1 on trial t is ∆w1,t = ηexp−ξ(t−1) (xt − For Experiment 2, the mean learning rate is high and so is x̂t ), where η ≥ 0 is the learning rate and ξ ≥ 0 determines the mean decay rate. This suggests that the model adjusts the how quickly the updated amount of weight drops. Likewise, associative weights largely on the early learning trials, but ∆w2,t = ηexp−ξ(t−1) (xt − x̂t )xt−1 . quickly halts doing so. The learned associative weights are There are some features of this model worth noting. First, w1 = 1.00 and w2 = −0.94. The negative weighting for the the associative weight w2 actually reflects the correlation be- preceding position enables the model to make symmetrical tween successive stimuli. Second, this model only learns the predictions between successive trials and |w2 | ≤ 1 enables the correlation between successive stimuli and contains no sum- model to gradually converge the predicted position toward the mary information of the whole function. In fact, it can be midpoint. applied to account for the learning of different time-varying For Experiment 3, the mean estimated learning rate is low functions, as no matter which form (complex or simple) the and the decay rate is high, suggesting that the model has function has, the learning of a time-varying function can al- not updated the associative weights too much since early tri- ways be viewed as the learning of the association between als. In fact, the learned associative weights, w1 = 0.01 and successive stimuli. Thus, our model should be regarded as an w2 = 0.98, together suggest that the model merely repeats the associative-based model, not a rule-based model. preceding target position as the current prediction. As the model captures the participants’ response patterns very well, Modeling it is implied that the participants did not actually learn the The model was fit to each participant’s data in each experi- function but just repeated what they saw as the prediction for ment with the stimulus positions being normalized between 0 the next trial. and 1. Each participant’s first response in each session was by It is revealed in Experiment 2 that the larger η or ξ is, the default the first input for the model. The initial weights of w1 smaller the error is (r = −.51, p < .05 for η and r = −.57, p < and w2 were set as 0 for all experiments except Experiment 3. .01 for ξ) but no significant correlations between parameters The model provided the best fit for Experiment 3 data when and human performance in other experiments. This might be w2 was initially set as 1, suggesting that participants in Ex- because that Experiment 1 and Experiment 3 are either too periment 3 were more likely to repeat the observed position easy or too hard for the participants to learn. of stimulus on the preceding trail as the response for current trail. The statistics of optimally estimated parameter values Exp 1 and the goodness of fit (RMSD) for all experiments are listed in Table 1. 10 Position Table 1: Mean goodness of fit and mean estimated parame- 0 Human ter values for a best fit with the standard deviation listed in Model parenthesis. −10 RMSD η ξ Exp 1 0.04 (0.02) 1.06 (0.71) 0.02 (0.09) 2 6 1014182226303438424650545862667074788286909498 Exp 2 0.08 (0.03) 1.73 (1.14) 0.30 (0.55) Trial Exp 3 0.09 (0.03) 0.43 (0.55) 1.81 (4.14) Figure 4: The model prediction and averaged human response The smaller the RMSD, the better the fit is. Apparently, the in Session 1 in Experiment 1. model fit all the data very well. See the crosses in Figure 4, Figure 5, and Figure 6 for the model prediction in Session 14 , which are quite close to the circles denoting the participants’ General Discussion responses. The main purpose of this study is to examine the characteris- The estimated learning rate for Experiment 1 is about 1 and tics of function learning with time-varying functions. Three the decay rate is quite small, suggesting that decay of learning experiments were conducted with different time-varying is not fast and leaning continues through training trials. The functions: linear, quadratic, and irregular. The differences learned associative weights for the moving size w1 = 0.30 and between these functions are not only the complexity of the the correlation with the preceding stimulus w2 = 0.70 suggest function form, but also the strength of correlation between that the participants predict the current position of the target successive stimuli. In the first two experiments, the correla- tion is very high regardless of the direction, whereas in the 4 The pattern is almost the same for Session 2. third experiment, the correlation is lower. 726 The behavioral data show that the learning of the linear Exp 3 and quadratic functions are easier than that of the irregular function, suggesting that the correlation between successive 10 stimuli is critical to function learning with time-varying func- 5 tions, not the number of parameters (or the complexity) of the Position Human function. The success of our model supports the associative- 0 Model based account and implies that a time-varying function can −5 be learned as a composition of many partial representations, not a holistic representation. −10 One may regard the learning of time-varying functions as −15 operant conditioning. That may or may not be true, de- 1 10 19 28 37 46 55 64 73 82 91 100 Trial pending on what we think is actually conditioned. If the response is the target for conditioning, then the learning of time-varying functions is not operant conditioning, as every Figure 6: The model prediction and human response of par- single response is new and it is impossible to reinforce the ticipants #14 in Session 1 in Experiment 3. likelihood for the same response to be made in the future. However, if the moving size is the target for conditioning, then for the case in which the target moves constantly (e.g., DeLosh, E. L., Busemeyer, J. R., & McDaniel, M. A. (1997). the linear function in Experiment 1), we may regard the learn- Extrapolation: The sine qua non for abstraction in function ing of the time-varying function as a kind of operant condi- learning. Jounral of Experimental Psychology: Learning, tioning. However, for the case where the target moves in a de- Memory, and Cognition, 23, 968-986. creasing (or increasing) speed (e.g., the quadratic function in Kalish, M. (2013). Learning and extraploating a periodic Experiment 2), it might not be suitable to equate the learning function. Memory & Cognition, 41, 886-896. of time-varying functions and operant conditioning. Future Kalish, M., Lewandowsky, S., & Kruschke, J. K. (2004). studies including the transfer trials are needed in order to ex- Population of linear experts: Knowledge partitioning and amine whether people form any concept for the time-varying funciton leanring. Psychological Review, 111, 1072-1099. function. Koh, K., & Meyer, D. E. (1991). Function learning: Induc- tion of continuous stimulus-response relations. Journal of Exp 2 Experimental Psychology: Learning, Memory, and Cogni- tion, 17, 811-836. 10 Lewandowsky, S., Kalish, M., & Ngang, S. K. (2002). Sim- plified learning in complex situations: Knowledge parti- tioning in function learning. Journal of Experimental Psy- Position Human 0 chology: General, 131, 163-193. Model McDaniel, M. A., & Busemeyer, J. R. (2005). The conceptual basis of function leanring and extraploation: Comparison −10 of rule-based and associative-based models. Psychonomic Bulletin & Review, 12, 24-42. 1 10 19 28 37 46 55 Trial 64 73 82 91 100 Navarro, D. J., & Perfors, A. (2009). Learning time-varying categories. In Proceedings of the 31st annual conference of cognitive science society (p. 414-424). austin, tx: Cognitive Figure 5: The model prediction and averaged human response science society. in Session 1 in Experiment 2. Navarro, D. J., & Perfors, A. (2012). Anticipating changes: Adaption and extrapolation in category learning. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Building References bridges across cognitive sciences around the world: Pro- Abdi, H., Valentin, D., & Edelman, B. (1999). Neural net- ceedings of the 34th annual conference of the cognitive sci- works. SAGE Publications, Inc. ence society (p. 809-814). Austin, TX: Cognitive Science Busemeyer, J. R., Byun, E., Delosh, E., & McDaniel, M. A. Society. (1997). Learning functional relations based on experience Navarro, D. J., Perfors, A., & Vong, W. K. (2013). Learning with input-output pairs by humans and artificial neural net- time-varying categories. Memory and Cognition, 41, 917- works (K. Lamberts & D. R. Shanks, Eds.). Cambridge, 927. MA, US: The MIT Press. Peirce, J. W. (2007). Psychopy - psychophysics software in Carroll, J. D. (1963). Function learning: The learnig of python. Journal of Neuroscience Methods, 162, 8-13. continuous functional maps relating stimulus and response coninua. Princetron, NJ: Educational Testing Service. 727