=Paper=
{{Paper
|id=Vol-2699/paper25
|storemode=property
|title=Performance Prediction of Elementary School Students in Search Tasks
|pdfUrl=https://ceur-ws.org/Vol-2699/paper25.pdf
|volume=Vol-2699
|authors=Roberto González-Ibañez,Luz Chourio-Acevedo,María Escobar-Macaya
|dblpUrl=https://dblp.org/rec/conf/cikm/Gonzalez-Ibanez20a
}}
==Performance Prediction of Elementary School Students in Search Tasks==
Performance Prediction of Elementary School Students in Search Tasks Roberto González-Ibañeza , Luz Chourio-Acevedoa,b and María Escobar-Macayaa a Universidad de Santiago de Chile, Avenida Libertador Bernardo O’Higgins nº 3363. Estación Central, Santiago, Chile b Centro Nacional de Desarrollo e Investigación en Tecnologías Libres, Avenida Humberto Carnevalli, Edificio CENDITEL, Mérida, Venezuela Abstract In the last two decades, the use of online resources in educational settings has seen an unprecedented growth. Regrettably, students’ online inquiry competences (OIC) are not necessarily well developed to face problems involving information inten- sive domains. While different OIC development approaches have been proposed to address this situation, these fail in timely identifying their effects on students’ OIC applied to practical search scenarios. To address this drawback, in this article we study models to predict students’ search performance in the context of an OIC evaluation test. Our approach focuses on exploiting demographic, behavioral, cognitive, and affective features, to predict – at four points of the overall search process – whether students succeed or fail in finding relevant documents to accomplish a research task. Our preliminary results show that it is possible to anticipate the overall search performance of students with moderate accuracy at the 25%, 50%, 75%, and 90% of the search session progress. These findings illustrate potential benefits and limitations of using non-obstrusive aggregated signals to timely predict search performance in learning contexts. Keywords Search perfomance, prediction, classification, elementary school 1. Introduction contexts, prediction focuses on forecasting performance by estimating unknown values of variables that char- Internet, and particularly the World Wide Web (WWW), acterize students. Such values typically relate to per- has become the main resource for students who look formance, knowledge, and scores. Prediction can be for information to complete their school assignments. also used to: identify learning styles, determine whether Although abundant, not all the content on the Web is a student will answer a question correctly, model knowl- curated[1]. This poses a major problem for students edge changes, and determine non-observable learning who may not be well equipped in terms of OIC. In- variables [4]. deed, knowing what information is needed and how In this article, we explore the possibility to antici- to search for it (i.e., some component skills of OIC) is pate student’s search performance by exploiting a set crucial to succeed in online research [2]. To tackle this of demographic, behavioral, cognitive, and affective problem, different approaches to help students in the features through machine learning. The remaining sec- development of OIC have been proposed [1, 3]. A fun- tions of this article are organized as follows. First, we damental limitation of these approaches is their inabil- describe the methodological approach adopted for this ity to timely determine whether students will succeed work. Second, we present preliminary results. Finally, or fail when engaging in actual search tasks. we conclude with a discussion of the results, their im- In the context of OIC development, knowing in ad- plications, and future work. vance how a student will perform in a search task could be particularly useful to both educators and students. First, educators could offer opportune feedback and 2. Method support to their students, thus avoiding late evalua- tions typically available only after tests are completed. 2.1. Dataset Second, students themselves could be more aware of To conduct this study, we relied on a subset of the their own performance, which could help them to cor- data collected as part of the iFuCo project [5]. Our rect themselves or look for support. In educational sample contains search sessions from 350 Finnish stu- Proceedings of the CIKM 2020 Workshops, October 19-20, 2020, dents performing two independent research tasks, this Galway, Ireland in the context of an evaluation of OIC. A summary of email: roberto.gonzalez.i@usach.cl (R. González-Ibañez); demographic data of the students whose records are luz.chourio@usach.cl (L. Chourio-Acevedo); maria.escobarm@usach.cl (M. Escobar-Macaya) included in our study is presented in Table1. © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Records in this dataset were captured through NEU- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) RONE (oNlinE inquiry expeRimentatiON systEm) [6]. Table 1 Table 2 Demographic data of the students Dataset attributes Attribute Description Finnish cities Tampere, Jyväskylä, Turku Behavior (during the session) Grades Fifth and sixth Total.Time (TT) Segment total time Ages 12-13 years old Girls 48.18% Stay.Pag.Relv (SR) Dwell time in relevant pages Boys 51.82% Stay.Pag.NonRelv(SnR) Dwell time in non-relevant pages Query.Time (QT) Query writing time Count.Queries (CQ) Number of queries This system offered a realistic simulation of a search Q.Mod (QM) Number of query modifications Q.Entropy (QE) Average query entropy engine operating on a controlled collection of web doc- Total.Cover (TC) Total coverage uments for each research task. The document collec- Usf.Cover (UC) Useful coverage (dwell time ≥ 30 seconds) tion was developed by the research team and com- Relv.Coverage (RC) Number of relevant pages visited prised 20 web pages per tasks, three of them defined as Clicks.Relv (CR) Number of clicks within relevant pages relevant. Regarding the latter documents, these were Clicks.NonRelv (CnR) Number of clicks within non-relevant pages created by researchers and all three were required to Mouse.Mov.Relv Number of mouse movements be found in order to accomplish each research task. (MR) within relevant pages The dataset contains various types of data, which Mouse.Mov.NonRelv(MnR) Number of mouse movements within non-relevant pages includes behavioral, cognitive, affective, and demogra- Scroll.Mov.Relv(SMR) Number of scrolls within relevant pages phic variables. Table 2 lists all the variables included Scroll.Mov.NonRelv(SMnR) Number of scrolls within non-relevant pages in this dataset. Demographic Sex Girl, Boy Affective (SAM-based scale [8]) 2.2. Analysis procedure Pos Valence (Positive - Negative scale) Our general approach to evaluate the feasibility of pre- Cal Activation (Calm - excited scale) dicting search performance focuses on four moments Cognitive(Survey) within students’ search sessions: early (25%), middle Prior.Knowledge (PK) Prior knowledge on task topic (1 to 5 scale) (50%), late (75%), and close-to-end (90%). Based on this Perceived.Difficulty (PD) Perceived task difficulty level (1 to 5 scale) nominal division, we aim to compare different mod- class Pass (A), Fail (R) els in the classification task of whether students will fail or succeed in the overall search task (i.e., binary classification). To determine whether a student failed or succeeded in the search tasks, we relied on search score, a process- based measure defined in [7]. This measure accounts for both, the success in finding relevant documents and mistakes made during the search process. Since search scores range from 0 to 5, we defined a threshold of 3.3 to balance the data. This value was set to keep a slightly balanced dataset of pass/fail cases. Thus, stu- dents with a score of 3.3 or higher were labeled as Pass Figure 1: Subset generation based on normalized search sessions. (46%), whereas those below this threshold were labeled as Fail (54%). Following, we normalized search sessions, which session (See Figure 1). lasted a maximum of 8 minutes. Normalization was We followed the Knowledge Discovery in Data bases necessary to have all sessions in a common duration (KDD) process with each dataset, thus we performed scale, which were now expressed from 0% to 100%. data selection, preprocessing, transformation, data min- Next, we proceeded to generate four additional subsets ing, and evaluation/interpretation to derive knowledge. of sessions based on the four moments stated above. To implement these stages, we used both Weka and R. As a result, the first set contains session data of each After preprocessing data, we ended up with a to- student from 0% to 25%, the second set comprised data tal of 660 full search sessions. For the purpose of this from 0% to 50%, and so forth. Each subset contained study, we discarded incomplete sessions (due to con- the Pass or Fail label computed at 100% of each search Table 3 Table 4 Automatic attribute evaluation. Support metrics of the best models ob- CFSSubsetEval InfoGainAttributeEval tained(class=Pass/Fail). 25% TT, SnR, QE, TC, MR TT, SnR, MR, QE, TC, Sex 25% 50% 75% 90% TT, SR, SnR, QE, TC, TT, SnR, MR, SR, MnR, QE, Classification Classification Random Logistic 50% Model via Regression via Regression Forest Regression MR, MnR SR, TC, RC, Sex, UC # Features 11 10 6 10 RC, MR, SnR, TT, SR, TC, TT, SR, TC, TT, SR, TC, SnR, TC, RC, 75% TT, SR, SnR, RC, MR TT, SnR, SMR, MnR, Sex RC, UC, QM, RC, UC, CQ, UC, QT, CQ, Features TC, RC, QE, SMR, MnR, QM, Sex, QE, MR,SMR, RC, SR, MR, SnR, TT, TC, MR, Sex 90% TT, SR, SnR, RC, MR Sex, Pos PK, PD SMnR SMR, MnR, Sex Area under 0.736 0.770 0.827 0.866 curve ROC Error (%e) 30.00% 27.28% 23.64% 19.55% Precision 0.690 0.734 0.760 0.792 nection problems) and those with corrupted data. These F-Measure 0.669 0.691 0.783 0.790 problems were mainly caused by connection problems or incompatibility of browsers with NEURONE. Once features were selected, preprocessed, and trans- random forest, multilayer perceptron, SMO RBF ker- formed, we created vectors of features containing ag- nel, and SMO poly kernel. All models were trained and gregated session data (mostly behavioral) until the cor- tested through 10-fold cross-validation. The classes in responding interval (i.e., 25%, 50%, 75%, 90%). In ad- all cases were linked to the Pass/Fail labels computed dition, these vectors contained prior-session features at 100%, hence our classifiers were actually prediction from demographic, cognitive, and affective variables. models attempting to determine the overall search per- Finally, Pass/Fail labels (i.e., class) were added. Over- formance of students. Results were compared in terms all, our vectors contained 21 features plus the class. of precision, F-Measure, number of attributes, and area With these vectors, we proceeded to identify promi- under the ROC curve (AUC). A summary of the best nent features and build binary classifiers through dif- results achieved at each time point (in terms of AUC) ferent algorithms and approaches. Results achieved by is presented in Table 4. these classifier in the task of determining the pass/fail labels are presented in the following section. 4. Discussion As illustrated in Table 4, different models, with differ- 3. Results ent set of features achieved the highest AUC at differ- After building vectors in each subset, we ran auto- ent time points. At an early stage of students’ search matic attribute evaluation in order to determine which processes (i,e., 25%), our best model is based on lin- features could contribute the most to the classification ear regression over 11 features with an AUC of 0.736 task. This procedure was conducted using two Weka and an error of 30%. Then, at 50% of search sessions, algorithms, namely, CFSSubsetEval and InfoGainAt- the best model is also based on linear regression, how- tributeEval. As a result of this procedure, eight groups ever the set of features is slightly different and per- of features were identified, two per subset, as shown formance increases in 4.6% in terms of AUC. Later on, in Table 3. Additionally, we performed attribute scan- at 75% of search progress, the best model is based on ning, which led us to discard or include other features random forest over six features. In this case, perfor- in all four subsets. On the one hand we discarded vari- mance in terms of AUC shows an increment of 12.36% ables related to clicks in relevant and non-relevant pa- with respect to our early-stage best model. Also, a re- ges since they did not improve nor worsen classifica- duction in error by almost 7% is noted. Finally, very tion performance. In other words, their presence in- late at students’ search sessions (i.e., 90%), the best creased problem dimensionality in terms of features model is based on logistic regression over 10 features. unnecessarily. On the other hand, we included cog- In this case, AUC is 0.866, whereas error was reduced nitive measures (i.e., prior knowledge and perceived to 19.55%. task difficulty) and an affective measure (Pos) as input In this group there are features involving time spent variables to the search process [9]. in relevant and non-relevant pages, query-related fea- Next, by combining the selected features (those in tures, document coverage, and mouse movements, to Table 3 and positivity score (Pos)) following a brute- name a few. In addition, we highlight that sex (i.e., a force approach, we built classifiers through linear re- demographic feature) appears as a prominent feature gression, logistic regression, Naïve Bayes, JRIP, J48, used by our best performing models at 25%, 50%, and 75%. Additionally, an affective feature (Pos, which ex- press valence in a negative-positive scale) was present AKA/EDU-03). in the best performing model at 25%. Likewise, prior knowledge on the topic (PK) and perceived task dif- ficulty (PD) are used in the best performing model at References 50%. We note that these particular input features, which [1] F. Baji, Z. Bigdeli, A. Parsa, C. Haeusler, Devel- are captured before search sessions start, seem to play oping information literacy skills of the 6th grade some role in the way search processes are carried out. students using the big 6 model, Malaysian Jour- On the one hand, the fact that sex appears in three out nal of Library & Information Science 23 (2018) of four models (Table 4), indicates that girls and boys 1–15. may exhibit particular search patterns that could be [2] S. Majid, S. Foo, Y. Chang, Appraising informa- linked to search performance. On the other hand, the tion literacy skills of students in singapore, Aslib presence of an affective feature (i.e., Pos) also supports Journal of Information Management (2020). the idea that searchers’ initial affective states may shape [3] H. Zhang, C. Zhu, A study of digital media liter- their search behaviors and their relevance assessments acy of the 5th and 6th grade primary students in (e.g., participants in negative states being more sys- beijing, The Asia-Pacific Education Researcher tematic than those in positive states) [10, 9]. 25 (2016) 579–592. As expected, the earlier in the search process, the [4] C. Romero, S. Ventura, Educational data mining: higher the level of uncertainty to correctly predict the a review of the state of the art, IEEE Transac- overall search performance. On the contrary, the later tions on Systems, Man, and Cybernetics, Part C in the search process, the higher the level of certainty (Applications and Reviews) 40 (2010) 601–618. to determine whether students will succeed or fail once [5] M. Mikkila-Erdmann, E. Sormunen, T. Mikkonen, search sessions were completed. Despite the low-per- N. Erdmann, C. Kiili, M. Quintanilla, R. González- formance of classification models at 25%, this shed light Ibáñez, P. Leppanen, M. Vauras, A comparative that, to some extent, it is possible to timely predict study on learning and teaching online inquiry students’ search performance. More interestingly, our skills in finland and chile, in: European Confer- best model is rather simple and it relies on variables ence on Information Literacy (ECIL), volume 18, that can be captured easily in controlled and open en- 2017, p. 2017. vironments (e.g., mouse actions, query formulation fea- [6] R. González-Ibáñez, D. Gacitúa, E. Sormunen, tures, some demographic data). C. Kiili, Neurone: online inquiry experimenta- As for limitations of our prediction approach, the tion system, Proceedings of the Association for fact it is based on aggregated data at different moments Information Science and Technology 54 (2017) of students’ search leads to data loss. Indeed, the his- 687–689. tory of students’ actions while searching for informa- [7] E. Sormunen, R. González-Ibáñez, C. Kiili, P. H. tion (e.g., query formulation, page visit, scrolling ac- Leppänen, M. Mikkilä-Erdmann, N. Erdmann, tions, query reformulation, bookmarking, etc.) is com- M. Escobar-Macaya, A performance-based test pressed into single measures (e.g., means, sums, counts). for assessing students’ online inquiry compe- Such chain of actions could be crucial to anticipate tences in schools, in: European Conference on how students will perform in the short and long term. Information Literacy, Springer, 2017, pp. 673– In this sense, our future work will concentrate in study- 682. ing prediction approaches that take into account the [8] M. Bradley, P. Lang, Measuring emotion: the self- dynamics of search behaviors. Among these approaches assessment manikin and the semantic differen- we consider Markovian models and SVM with string- tial, Journal of behavior therapy and experimen- based kernels. tal psychiatry 25 (1994) 49–59. [9] R. González-Ibáñez, C. Shah, Performance effects 4.0.1. Acknowledgment of positive and negative affective states in a col- The work described in this article was partially sup- laborative information seeking task, in: CYTED- ported by the TUTELAGE project funded by the Na- RITOS International Workshop on Groupware, tional Agency for Research and Development (ANID) Springer, 2014, pp. 153–168. (FONDECYT Regular, grant no. 1201610); the Vicer- [10] R. Sinclair, M. Mark, The effects of mood state rectoría de Postgrado of the Universidad de Santiago on judgemental accuracy: Processing strategy as de Chile; and the iFuCo project funded by the Academy a mechanism, Cognition & Emotion 9 (1995) 417– of Finland (grant no. 294186) and ANID (grant no. 438.