-

Performance Prediction of Elementary School Students in Search Tasks

Roberto González-Ibañez

roberto.gonzalez.i@usach.cl 1

Luz Chourio-Acevedo

luz.chourio@usach.cl 0 1 0 Centro Nacional de Desarrollo e Investigación en Tecnologías Libres, Avenida Humberto Carnevalli , Edificio CENDITEL, Mérida , Venezuela 1 Universidad de Santiago de Chile, Avenida Libertador Bernardo O'Higgins no 3363. Estación Central , Santiago , Chile

In the last two decades, the use of online resources in educational settings has seen an unprecedented growth. Regrettably, students' online inquiry competences (OIC) are not necessarily well developed to face problems involving information intensive domains. While diferent OIC development approaches have been proposed to address this situation, these fail in timely identifying their efects on students' OIC applied to practical search scenarios. To address this drawback, in this article we study models to predict students' search performance in the context of an OIC evaluation test. Our approach focuses on exploiting demographic, behavioral, cognitive, and afective features, to predict - at four points of the overall search process - whether students succeed or fail in finding relevant documents to accomplish a research task. Our preliminary results show that it is possible to anticipate the overall search performance of students with moderate accuracy at the 25%, 50%, 75%, and 90% of the search session progress. These findings illustrate potential benefits and limitations of using non-obstrusive aggregated signals to timely predict search performance in learning contexts.

eol>Search perfomance prediction classification elementary school

María Escobar-Macayaa

1. Introduction

contexts, prediction focuses on forecasting performance by estimating unknown values of variables that charInternet, and particularly the World Wide Web (WWW), acterize students. Such values typically relate to perhas become the main resource for students who look formance, knowledge, and scores. Prediction can be for information to complete their school assignments. also used to: identify learning styles, determine whether Although abundant, not all the content on the Web is a student will answer a question correctly, model knowlcurated[ 1 ]. This poses a major problem for students edge changes, and determine non-observable learning who may not be well equipped in terms of OIC. In- variables [ 4 ]. deed, knowing what information is needed and how In this article, we explore the possibility to anticito search for it (i.e., some component skills of OIC) is pate student’s search performance by exploiting a set crucial to succeed in online research [ 2 ]. To tackle this of demographic, behavioral, cognitive, and afective problem, diferent approaches to help students in the features through machine learning. The remaining secdevelopment of OIC have been proposed [ 1, 3 ]. A fun- tions of this article are organized as follows. First, we damental limitation of these approaches is their inabil- describe the methodological approach adopted for this ity to timely determine whether students will succeed work. Second, we present preliminary results. Finally, or fail when engaging in actual search tasks. we conclude with a discussion of the results, their im

In the context of OIC development, knowing in ad- plications, and future work. vance how a student will perform in a search task could be particularly useful to both educators and students.

First, educators could ofer opportune feedback and 2. Method support to their students, thus avoiding late evaluations typically available only after tests are completed. 2.1. Dataset Second, students themselves could be more aware of their own performance, which could help them to correct themselves or look for support. In educational

To conduct this study, we relied on a subset of the data collected as part of the iFuCo project [ 5 ]. Our sample contains search sessions from 350 Finnish students performing two independent research tasks, this in the context of an evaluation of OIC. A summary of demographic data of the students whose records are included in our study is presented in Table1.

Records in this dataset were captured through NEURONE (oNlinE inquiry expeRimentatiON systEm) [ 6 ]. This system ofered a realistic simulation of a search engine operating on a controlled collection of web documents for each research task. The document collection was developed by the research team and comprised 20 web pages per tasks, three of them defined as relevant. Regarding the latter documents, these were created by researchers and all three were required to be found in order to accomplish each research task.

The dataset contains various types of data, which includes behavioral, cognitive, afective, and demographic variables. Table 2 lists all the variables included in this dataset. 2.2. Analysis procedure

Attribute Total.Time (TT) Stay.Pag.Relv (SR) Stay.Pag.NonRelv(SnR) Query.Time (QT) Count.Queries (CQ) Q.Mod (QM) Q.Entropy (QE) Total.Cover (TC) Usf.Cover (UC) Relv.Coverage (RC) Clicks.Relv (CR) Clicks.NonRelv (CnR) Mouse.Mov.Relv (MR)

Description Behavior (during the session)

Segment total time Dwell time in relevant pages Dwell time in non-relevant pages Query writing time Number of queries Number of query modifications Average query entropy Total coverage Useful coverage (dwell time ≥ 30 seconds) Number of relevant pages visited Number of clicks within relevant pages Number of clicks within non-relevant pages Number of mouse movements within relevant pages Mouse.Mov.NonRelv(MnR) Nwuitmhibnenroonf-mreoleuvsaenmtpoavegmesents Scroll.Mov.Relv(SMR) Number of scrolls within relevant pages Scroll.Mov.NonRelv(SMnR) Number of scrolls within non-relevant pages

Demographic Sex Girl, Boy

Afective (SAM-based scale [ 8 ]) Pos Valence (Positive - Negative scale) Cal Activation (Calm - excited scale)

Cognitive(Survey) Prior knowledge on task topic (1 to 5 scale) Perceived task dificulty level (1 to 5 scale) Pass (A), Fail (R) Our general approach to evaluate the feasibility of predicting search performance focuses on four moments within students’ search sessions: early (25%), middle Prior.Knowledge (PK) (50%), late (75%), and close-to-end (90%). Based on this Perceived.Dificulty (PD) nominal division, we aim to compare diferent mod- class els in the classification task of whether students will fail or succeed in the overall search task (i.e., binary classification).

To determine whether a student failed or succeeded in the search tasks, we relied on search score, a processbased measure defined in [ 7 ]. This measure accounts for both, the success in finding relevant documents and mistakes made during the search process. Since search scores range from 0 to 5, we defined a threshold of 3.3 to balance the data. This value was set to keep a slightly balanced dataset of pass/fail cases. Thus, students with a score of 3.3 or higher were labeled as Pass Figure 1: Subset generation based on normalized search (46%), whereas those below this threshold were labeled sessions. as Fail (54%).

Following, we normalized search sessions, which lasted a maximum of 8 minutes. Normalization was session (See Figure 1). necessary to have all sessions in a common duration We followed the Knowledge Discovery in Data bases scale, which were now expressed from 0% to 100%. (KDD) process with each dataset, thus we performed Next, we proceeded to generate four additional subsets data selection, preprocessing, transformation, data minof sessions based on the four moments stated above. ing, and evaluation/interpretation to derive knowledge. As a result, the first set contains session data of each To implement these stages, we used both Weka and R. student from 0% to 25%, the second set comprised data After preprocessing data, we ended up with a tofrom 0% to 50%, and so forth. Each subset contained tal of 660 full search sessions. For the purpose of this the Pass or Fail label computed at 100% of each search study, we discarded incomplete sessions (due to connection problems) and those with corrupted data. These problems were mainly caused by connection problems or incompatibility of browsers with NEURONE.

Once features were selected, preprocessed, and trans- random forest, multilayer perceptron, SMO RBF kerformed, we created vectors of features containing ag- nel, and SMO poly kernel. All models were trained and gregated session data (mostly behavioral) until the cor- tested through 10-fold cross-validation. The classes in responding interval (i.e., 25%, 50%, 75%, 90%). In ad- all cases were linked to the Pass/Fail labels computed dition, these vectors contained prior-session features at 100%, hence our classifiers were actually prediction from demographic, cognitive, and afective variables. models attempting to determine the overall search perFinally, Pass/Fail labels (i.e., class) were added. Over- formance of students. Results were compared in terms all, our vectors contained 21 features plus the class. of precision, F-Measure, number of attributes, and area

With these vectors, we proceeded to identify promi- under the ROC curve (AUC). A summary of the best nent features and build binary classifiers through dif- results achieved at each time point (in terms of AUC) ferent algorithms and approaches. Results achieved by is presented in Table 4. these classifier in the task of determining the pass/fail labels are presented in the following section.

4. Discussion

3. Results As illustrated in Table 4, diferent models, with diferent set of features achieved the highest AUC at diferAfter building vectors in each subset, we ran auto- ent time points. At an early stage of students’ search matic attribute evaluation in order to determine which processes (i,e., 25%), our best model is based on linfeatures could contribute the most to the classification ear regression over 11 features with an AUC of 0.736 task. This procedure was conducted using two Weka and an error of 30%. Then, at 50% of search sessions, algorithms, namely, CFSSubsetEval and InfoGainAt- the best model is also based on linear regression, howtributeEval. As a result of this procedure, eight groups ever the set of features is slightly diferent and perof features were identified, two per subset, as shown formance increases in 4.6% in terms of AUC. Later on, in Table 3. Additionally, we performed attribute scan- at 75% of search progress, the best model is based on ning, which led us to discard or include other features random forest over six features. In this case, perforin all four subsets. On the one hand we discarded vari- mance in terms of AUC shows an increment of 12.36% ables related to clicks in relevant and non-relevant pa- with respect to our early-stage best model. Also, a reges since they did not improve nor worsen classifica- duction in error by almost 7% is noted. Finally, very tion performance. In other words, their presence in- late at students’ search sessions (i.e., 90%), the best creased problem dimensionality in terms of features model is based on logistic regression over 10 features. unnecessarily. On the other hand, we included cog- In this case, AUC is 0.866, whereas error was reduced nitive measures (i.e., prior knowledge and perceived to 19.55%. task dificulty) and an afective measure (Pos) as input In this group there are features involving time spent variables to the search process [ 9 ]. in relevant and non-relevant pages, query-related fea

Next, by combining the selected features (those in tures, document coverage, and mouse movements, to Table 3 and positivity score (Pos)) following a brute- name a few. In addition, we highlight that sex (i.e., a force approach, we built classifiers through linear re- demographic feature) appears as a prominent feature gression, logistic regression, Naïve Bayes, JRIP, J48, used by our best performing models at 25%, 50%, and 75%. Additionally, an afective feature (Pos, which express valence in a negative-positive scale) was present in the best performing model at 25%. Likewise, prior knowledge on the topic (PK) and perceived task dififculty (PD) are used in the best performing model at 50%. We note that these particular input features, which are captured before search sessions start, seem to play some role in the way search processes are carried out.

On the one hand, the fact that sex appears in three out of four models (Table 4), indicates that girls and boys may exhibit particular search patterns that could be linked to search performance. On the other hand, the presence of an afective feature (i.e., Pos) also supports the idea that searchers’ initial afective states may shape their search behaviors and their relevance assessments (e.g., participants in negative states being more systematic than those in positive states) [ 10, 9 ].

As expected, the earlier in the search process, the higher the level of uncertainty to correctly predict the overall search performance. On the contrary, the later in the search process, the higher the level of certainty to determine whether students will succeed or fail once search sessions were completed. Despite the low-performance of classification models at 25%, this shed light that, to some extent, it is possible to timely predict students’ search performance. More interestingly, our best model is rather simple and it relies on variables that can be captured easily in controlled and open environments (e.g., mouse actions, query formulation features, some demographic data).

As for limitations of our prediction approach, the fact it is based on aggregated data at diferent moments of students’ search leads to data loss. Indeed, the history of students’ actions while searching for information (e.g., query formulation, page visit, scrolling actions, query reformulation, bookmarking, etc.) is compressed into single measures (e.g., means, sums, counts).

Such chain of actions could be crucial to anticipate how students will perform in the short and long term.

In this sense, our future work will concentrate in studying prediction approaches that take into account the dynamics of search behaviors. Among these approaches we consider Markovian models and SVM with stringbased kernels.

[1]

Baji ,

Bigdeli ,

Parsa ,

Haeusler , Developing information literacy skills of the 6th grade students using the big 6 model , Malaysian Journal of Library & Information Science 23 ( 2018 ) 1 - 15 .

[2]

Majid ,

Foo ,

Chang , Appraising information literacy skills of students in singapore , Aslib Journal of Information Management ( 2020 ).

[3]

Zhang ,

Zhu , A study of digital media literacy of the 5th and 6th grade primary students in beijing , The Asia-Pacific Education Researcher 25 ( 2016 ) 579 - 592 .

[4]

Romero ,

Ventura , Educational data mining: a review of the state of the art , IEEE Transactions on Systems, Man, and Cybernetics , Part C ( Applications and Reviews) 40 ( 2010 ) 601 - 618 .

[5]

Mikkila-Erdmann ,

Sormunen ,

Mikkonen ,

Erdmann ,

Kiili ,

Quintanilla , R. GonzálezIbáñez, P. Leppanen,

Vauras , A comparative study on learning and teaching online inquiry skills in finland and chile , in: European Conference on Information Literacy (ECIL) , volume 18 , 2017 , p. 2017 .

[6]

González-Ibáñez ,

Gacitúa ,

Sormunen ,

Kiili , Neurone: online inquiry experimentation system , Proceedings of the Association for Information Science and Technology 54 ( 2017 ) 687 - 689 .

[7]

Sormunen ,

González-Ibáñez ,

Kiili ,

P. H.

Leppänen ,

Mikkilä-Erdmann ,

Erdmann ,

Escobar-Macaya , A performance-based test for assessing students' online inquiry competences in schools , in: European Conference on Information Literacy , Springer, 2017 , pp. 673 - 682 .

[8]

Bradley ,

Lang , Measuring emotion: the selfassessment manikin and the semantic diferential , Journal of behavior therapy and experimental psychiatry 25 ( 1994 ) 49 - 59 .

[9]

González-Ibáñez ,

Shah , Performance efects 4.0.1. Acknowledgment of positive and negative afective states in a colThe work described in this article was partially sup- laborative information seeking task, in: CYTEDported by the TUTELAGE project funded by the Na- RITOS International Workshop on Groupware, tional Agency for Research and Development (ANID) Springer , 2014 , pp. 153 - 168 . (FONDECYT Regular, grant no. 1201610 ); the Vicer- [10]

Sinclair ,

Mark , The efects of mood state rectoría de Postgrado of the Universidad de Santiago on judgemental accuracy: Processing strategy as de Chile; and the iFuCo project funded by the Academy a mechanism , Cognition & Emotion 9 ( 1995 ) 417- of Finland (grant no. 294186) and ANID (grant no. 438.