=Paper= {{Paper |id=Vol-2699/paper25 |storemode=property |title=Performance Prediction of Elementary School Students in Search Tasks |pdfUrl=https://ceur-ws.org/Vol-2699/paper25.pdf |volume=Vol-2699 |authors=Roberto González-Ibañez,Luz Chourio-Acevedo,María Escobar-Macaya |dblpUrl=https://dblp.org/rec/conf/cikm/Gonzalez-Ibanez20a }} ==Performance Prediction of Elementary School Students in Search Tasks== https://ceur-ws.org/Vol-2699/paper25.pdf
Performance Prediction of Elementary School Students
in Search Tasks
Roberto González-Ibañeza , Luz Chourio-Acevedoa,b and María Escobar-Macayaa
a Universidad de Santiago de Chile, Avenida Libertador Bernardo O’Higgins nº 3363. Estación Central, Santiago, Chile
b Centro Nacional de Desarrollo e Investigación en Tecnologías Libres, Avenida Humberto Carnevalli, Edificio CENDITEL, Mérida, Venezuela



                                          Abstract
                                          In the last two decades, the use of online resources in educational settings has seen an unprecedented growth. Regrettably,
                                          students’ online inquiry competences (OIC) are not necessarily well developed to face problems involving information inten-
                                          sive domains. While different OIC development approaches have been proposed to address this situation, these fail in timely
                                          identifying their effects on students’ OIC applied to practical search scenarios. To address this drawback, in this article we
                                          study models to predict students’ search performance in the context of an OIC evaluation test. Our approach focuses on
                                          exploiting demographic, behavioral, cognitive, and affective features, to predict – at four points of the overall search process
                                          – whether students succeed or fail in finding relevant documents to accomplish a research task. Our preliminary results
                                          show that it is possible to anticipate the overall search performance of students with moderate accuracy at the 25%, 50%, 75%,
                                          and 90% of the search session progress. These findings illustrate potential benefits and limitations of using non-obstrusive
                                          aggregated signals to timely predict search performance in learning contexts.

                                          Keywords
                                          Search perfomance, prediction, classification, elementary school


1. Introduction                                           contexts, prediction focuses on forecasting performance
                                                          by estimating unknown values of variables that char-
Internet, and particularly the World Wide Web (WWW), acterize students. Such values typically relate to per-
has become the main resource for students who look formance, knowledge, and scores. Prediction can be
for information to complete their school assignments. also used to: identify learning styles, determine whether
Although abundant, not all the content on the Web is a student will answer a question correctly, model knowl-
curated[1]. This poses a major problem for students edge changes, and determine non-observable learning
who may not be well equipped in terms of OIC. In- variables [4].
deed, knowing what information is needed and how            In this article, we explore the possibility to antici-
to search for it (i.e., some component skills of OIC) is pate student’s search performance by exploiting a set
crucial to succeed in online research [2]. To tackle this of demographic, behavioral, cognitive, and affective
problem, different approaches to help students in the features through machine learning. The remaining sec-
development of OIC have been proposed [1, 3]. A fun- tions of this article are organized as follows. First, we
damental limitation of these approaches is their inabil- describe the methodological approach adopted for this
ity to timely determine whether students will succeed work. Second, we present preliminary results. Finally,
or fail when engaging in actual search tasks.             we conclude with a discussion of the results, their im-
   In the context of OIC development, knowing in ad- plications, and future work.
vance how a student will perform in a search task could
be particularly useful to both educators and students.
First, educators could offer opportune feedback and 2. Method
support to their students, thus avoiding late evalua-
tions typically available only after tests are completed. 2.1. Dataset
Second, students themselves could be more aware of To conduct this study, we relied on a subset of the
their own performance, which could help them to cor- data collected as part of the iFuCo project [5]. Our
rect themselves or look for support. In educational sample contains search sessions from 350 Finnish stu-
Proceedings of the CIKM 2020 Workshops, October 19-20, 2020,                                                       dents performing two independent research tasks, this
Galway, Ireland                                                                                                    in the context of an evaluation of OIC. A summary of
email: roberto.gonzalez.i@usach.cl (R. González-Ibañez);                                                           demographic data of the students whose records are
luz.chourio@usach.cl (L. Chourio-Acevedo);
maria.escobarm@usach.cl (M. Escobar-Macaya)                                                                        included in our study is presented in Table1.
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      Records in this dataset were captured through NEU-
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        RONE (oNlinE inquiry expeRimentatiON systEm) [6].
Table 1                                                      Table 2
Demographic data of the students                             Dataset attributes
                                                                        Attribute                                Description
          Finnish cities   Tampere, Jyväskylä, Turku
                                                                                        Behavior (during the session)
          Grades               Fifth and sixth
                                                                Total.Time (TT)                Segment total time
          Ages                 12-13 years old
          Girls                     48.18%                      Stay.Pag.Relv (SR)             Dwell time in relevant pages

          Boys                      51.82%                      Stay.Pag.NonRelv(SnR)          Dwell time in non-relevant pages
                                                                Query.Time (QT)                Query writing time
                                                                Count.Queries (CQ)             Number of queries

This system offered a realistic simulation of a search          Q.Mod (QM)                     Number of query modifications
                                                                Q.Entropy (QE)                 Average query entropy
engine operating on a controlled collection of web doc-
                                                                Total.Cover (TC)               Total coverage
uments for each research task. The document collec-             Usf.Cover (UC)                 Useful coverage (dwell time ≥ 30 seconds)
tion was developed by the research team and com-                Relv.Coverage (RC)             Number of relevant pages visited
prised 20 web pages per tasks, three of them defined as         Clicks.Relv (CR)               Number of clicks within relevant pages
relevant. Regarding the latter documents, these were            Clicks.NonRelv (CnR)           Number of clicks within non-relevant pages
created by researchers and all three were required to           Mouse.Mov.Relv                 Number of mouse movements
be found in order to accomplish each research task.             (MR)                           within relevant pages

   The dataset contains various types of data, which            Mouse.Mov.NonRelv(MnR)
                                                                                               Number of mouse movements
                                                                                               within non-relevant pages
includes behavioral, cognitive, affective, and demogra-         Scroll.Mov.Relv(SMR)           Number of scrolls within relevant pages
phic variables. Table 2 lists all the variables included        Scroll.Mov.NonRelv(SMnR)       Number of scrolls within non-relevant pages
in this dataset.                                                                               Demographic
                                                                Sex                            Girl, Boy
                                                                                       Affective (SAM-based scale [8])
2.2. Analysis procedure
                                                                Pos                            Valence (Positive - Negative scale)
Our general approach to evaluate the feasibility of pre-        Cal                            Activation (Calm - excited scale)

dicting search performance focuses on four moments                                           Cognitive(Survey)

within students’ search sessions: early (25%), middle           Prior.Knowledge (PK)           Prior knowledge on task topic (1 to 5 scale)

(50%), late (75%), and close-to-end (90%). Based on this        Perceived.Difficulty (PD)      Perceived task difficulty level (1 to 5 scale)

nominal division, we aim to compare different mod-              class                          Pass (A), Fail (R)

els in the classification task of whether students will
fail or succeed in the overall search task (i.e., binary
classification).
   To determine whether a student failed or succeeded
in the search tasks, we relied on search score, a process-
based measure defined in [7]. This measure accounts
for both, the success in finding relevant documents
and mistakes made during the search process. Since
search scores range from 0 to 5, we defined a threshold
of 3.3 to balance the data. This value was set to keep a
slightly balanced dataset of pass/fail cases. Thus, stu-
dents with a score of 3.3 or higher were labeled as Pass     Figure 1: Subset generation based on normalized search
                                                             sessions.
(46%), whereas those below this threshold were labeled
as Fail (54%).
   Following, we normalized search sessions, which
                                                             session (See Figure 1).
lasted a maximum of 8 minutes. Normalization was
                                                                We followed the Knowledge Discovery in Data bases
necessary to have all sessions in a common duration
                                                             (KDD) process with each dataset, thus we performed
scale, which were now expressed from 0% to 100%.
                                                             data selection, preprocessing, transformation, data min-
Next, we proceeded to generate four additional subsets
                                                             ing, and evaluation/interpretation to derive knowledge.
of sessions based on the four moments stated above.
                                                             To implement these stages, we used both Weka and R.
As a result, the first set contains session data of each
                                                                After preprocessing data, we ended up with a to-
student from 0% to 25%, the second set comprised data
                                                             tal of 660 full search sessions. For the purpose of this
from 0% to 50%, and so forth. Each subset contained
                                                             study, we discarded incomplete sessions (due to con-
the Pass or Fail label computed at 100% of each search
Table 3                                                          Table 4
Automatic attribute evaluation.                                  Support     metrics      of          the           best        models         ob-
                CFSSubsetEval         InfoGainAttributeEval      tained(class=Pass/Fail).
        25%   TT, SnR, QE, TC, MR    TT, SnR, MR, QE, TC, Sex                           25%             50%            75%           90%
              TT, SR, SnR, QE, TC,   TT, SnR, MR, SR, MnR, QE,                    Classification   Classification    Random     Logistic
        50%                                                          Model
                                                                                  via Regression   via Regression    Forest     Regression
              MR, MnR                SR, TC, RC, Sex, UC
                                                                     # Features   11               10                6          10
                                     RC, MR, SnR, TT, SR, TC,                     TT, SR, TC,      TT, SR, TC,                  SnR, TC, RC,
        75%   TT, SR, SnR, RC, MR                                                                                    TT, SnR,
                                     SMR, MnR, Sex                                RC, UC, QM,      RC, UC, CQ,                  UC, QT, CQ,
                                                                     Features                                        TC, RC,
                                                                                  QE, SMR, MnR,    QM, Sex,                     QE, MR,SMR,
                                     RC, SR, MR, SnR, TT, TC,                                                        MR, Sex
        90%   TT, SR, SnR, RC, MR                                                 Sex, Pos         PK, PD                       SMnR
                                     SMR, MnR, Sex                   Area under
                                                                                  0.736            0.770             0.827      0.866
                                                                     curve ROC
                                                                     Error (%e)   30.00%           27.28%            23.64%     19.55%
                                                                     Precision    0.690            0.734             0.760      0.792
nection problems) and those with corrupted data. These               F-Measure    0.669            0.691             0.783      0.790

problems were mainly caused by connection problems
or incompatibility of browsers with NEURONE.
   Once features were selected, preprocessed, and trans- random forest, multilayer perceptron, SMO RBF ker-
formed, we created vectors of features containing ag- nel, and SMO poly kernel. All models were trained and
gregated session data (mostly behavioral) until the cor- tested through 10-fold cross-validation. The classes in
responding interval (i.e., 25%, 50%, 75%, 90%). In ad- all cases were linked to the Pass/Fail labels computed
dition, these vectors contained prior-session features at 100%, hence our classifiers were actually prediction
from demographic, cognitive, and affective variables. models attempting to determine the overall search per-
Finally, Pass/Fail labels (i.e., class) were added. Over- formance of students. Results were compared in terms
all, our vectors contained 21 features plus the class.    of precision, F-Measure, number of attributes, and area
   With these vectors, we proceeded to identify promi- under the ROC curve (AUC). A summary of the best
nent features and build binary classifiers through dif- results achieved at each time point (in terms of AUC)
ferent algorithms and approaches. Results achieved by is presented in Table 4.
these classifier in the task of determining the pass/fail
labels are presented in the following section.
                                                                 4. Discussion
                                                                 As illustrated in Table 4, different models, with differ-
3. Results                                                       ent set of features achieved the highest AUC at differ-
After building vectors in each subset, we ran auto-              ent time points. At an early stage of students’ search
matic attribute evaluation in order to determine which           processes (i,e., 25%), our best model is based on lin-
features could contribute the most to the classification         ear regression over 11 features with an AUC of 0.736
task. This procedure was conducted using two Weka                and an error of 30%. Then, at 50% of search sessions,
algorithms, namely, CFSSubsetEval and InfoGainAt-                the best model is also based on linear regression, how-
tributeEval. As a result of this procedure, eight groups         ever the set of features is slightly different and per-
of features were identified, two per subset, as shown            formance increases in 4.6% in terms of AUC. Later on,
in Table 3. Additionally, we performed attribute scan-           at 75% of search progress, the best model is based on
ning, which led us to discard or include other features          random forest over six features. In this case, perfor-
in all four subsets. On the one hand we discarded vari-          mance in terms of AUC shows an increment of 12.36%
ables related to clicks in relevant and non-relevant pa-         with respect to our early-stage best model. Also, a re-
ges since they did not improve nor worsen classifica-            duction in error by almost 7% is noted. Finally, very
tion performance. In other words, their presence in-             late at students’ search sessions (i.e., 90%), the best
creased problem dimensionality in terms of features              model is based on logistic regression over 10 features.
unnecessarily. On the other hand, we included cog-               In this case, AUC is 0.866, whereas error was reduced
nitive measures (i.e., prior knowledge and perceived             to 19.55%.
task difficulty) and an affective measure (Pos) as input            In this group there are features involving time spent
variables to the search process [9].                             in relevant and non-relevant pages, query-related fea-
   Next, by combining the selected features (those in            tures, document coverage, and mouse movements, to
Table 3 and positivity score (Pos)) following a brute-           name a few. In addition, we highlight that sex (i.e., a
force approach, we built classifiers through linear re-          demographic feature) appears as a prominent feature
gression, logistic regression, Naïve Bayes, JRIP, J48,           used by our best performing models at 25%, 50%, and
                                                                 75%. Additionally, an affective feature (Pos, which ex-
press valence in a negative-positive scale) was present AKA/EDU-03).
in the best performing model at 25%. Likewise, prior
knowledge on the topic (PK) and perceived task dif-
ficulty (PD) are used in the best performing model at References
50%. We note that these particular input features, which
                                                            [1] F. Baji, Z. Bigdeli, A. Parsa, C. Haeusler, Devel-
are captured before search sessions start, seem to play
                                                                oping information literacy skills of the 6th grade
some role in the way search processes are carried out.
                                                                students using the big 6 model, Malaysian Jour-
On the one hand, the fact that sex appears in three out
                                                                nal of Library & Information Science 23 (2018)
of four models (Table 4), indicates that girls and boys
                                                                1–15.
may exhibit particular search patterns that could be
                                                            [2] S. Majid, S. Foo, Y. Chang, Appraising informa-
linked to search performance. On the other hand, the
                                                                tion literacy skills of students in singapore, Aslib
presence of an affective feature (i.e., Pos) also supports
                                                                Journal of Information Management (2020).
the idea that searchers’ initial affective states may shape
                                                            [3] H. Zhang, C. Zhu, A study of digital media liter-
their search behaviors and their relevance assessments
                                                                acy of the 5th and 6th grade primary students in
(e.g., participants in negative states being more sys-
                                                                beijing, The Asia-Pacific Education Researcher
tematic than those in positive states) [10, 9].
                                                                25 (2016) 579–592.
   As expected, the earlier in the search process, the
                                                            [4] C. Romero, S. Ventura, Educational data mining:
higher the level of uncertainty to correctly predict the
                                                                a review of the state of the art, IEEE Transac-
overall search performance. On the contrary, the later
                                                                tions on Systems, Man, and Cybernetics, Part C
in the search process, the higher the level of certainty
                                                                (Applications and Reviews) 40 (2010) 601–618.
to determine whether students will succeed or fail once
                                                            [5] M. Mikkila-Erdmann, E. Sormunen, T. Mikkonen,
search sessions were completed. Despite the low-per-
                                                                N. Erdmann, C. Kiili, M. Quintanilla, R. González-
formance of classification models at 25%, this shed light
                                                                Ibáñez, P. Leppanen, M. Vauras, A comparative
that, to some extent, it is possible to timely predict
                                                                study on learning and teaching online inquiry
students’ search performance. More interestingly, our
                                                                skills in finland and chile, in: European Confer-
best model is rather simple and it relies on variables
                                                                ence on Information Literacy (ECIL), volume 18,
that can be captured easily in controlled and open en-
                                                                2017, p. 2017.
vironments (e.g., mouse actions, query formulation fea-
                                                            [6] R. González-Ibáñez, D. Gacitúa, E. Sormunen,
tures, some demographic data).
                                                                C. Kiili, Neurone: online inquiry experimenta-
   As for limitations of our prediction approach, the
                                                                tion system, Proceedings of the Association for
fact it is based on aggregated data at different moments
                                                                Information Science and Technology 54 (2017)
of students’ search leads to data loss. Indeed, the his-
                                                                687–689.
tory of students’ actions while searching for informa-
                                                            [7] E. Sormunen, R. González-Ibáñez, C. Kiili, P. H.
tion (e.g., query formulation, page visit, scrolling ac-
                                                                Leppänen, M. Mikkilä-Erdmann, N. Erdmann,
tions, query reformulation, bookmarking, etc.) is com-
                                                                M. Escobar-Macaya, A performance-based test
pressed into single measures (e.g., means, sums, counts).
                                                                for assessing students’ online inquiry compe-
Such chain of actions could be crucial to anticipate
                                                                tences in schools, in: European Conference on
how students will perform in the short and long term.
                                                                Information Literacy, Springer, 2017, pp. 673–
In this sense, our future work will concentrate in study-
                                                                682.
ing prediction approaches that take into account the
                                                            [8] M. Bradley, P. Lang, Measuring emotion: the self-
dynamics of search behaviors. Among these approaches
                                                                assessment manikin and the semantic differen-
we consider Markovian models and SVM with string-
                                                                tial, Journal of behavior therapy and experimen-
based kernels.
                                                                tal psychiatry 25 (1994) 49–59.
                                                            [9] R. González-Ibáñez, C. Shah, Performance effects
4.0.1. Acknowledgment                                           of positive and negative affective states in a col-
The work described in this article was partially sup-           laborative information seeking task, in: CYTED-
ported by the TUTELAGE project funded by the Na-                RITOS International Workshop on Groupware,
tional Agency for Research and Development (ANID)               Springer, 2014, pp. 153–168.
(FONDECYT Regular, grant no. 1201610); the Vicer- [10] R. Sinclair, M. Mark, The effects of mood state
rectoría de Postgrado of the Universidad de Santiago            on judgemental accuracy: Processing strategy as
de Chile; and the iFuCo project funded by the Academy           a mechanism, Cognition & Emotion 9 (1995) 417–
of Finland (grant no. 294186) and ANID (grant no.               438.