=Paper= {{Paper |id=Vol-2415/paper09 |storemode=property |title=Predicting early dropout student is a matter of checking completed quizzes: the case of an online statistics module |pdfUrl=https://ceur-ws.org/Vol-2415/paper09.pdf |volume=Vol-2415 |authors=Josep Figueroa-Cañas,Teresa Sancho-Vinuesa |dblpUrl=https://dblp.org/rec/conf/lasi-spain/Figueroa-CanasS19 }} ==Predicting early dropout student is a matter of checking completed quizzes: the case of an online statistics module== https://ceur-ws.org/Vol-2415/paper09.pdf
                            Predicting early dropout students is a matter of checking
                            completed quizzes: the case of an online statistics module

                              Josep Figueroa-Cañas [0000-0002-6790-9142] and Teresa Sancho-Vinuesa [0000-0002-0642-2912]

                                                       Universitat Oberta de Catalunya, Barcelona, Spain
                                                           [jfigueroa, tsancho]@uoc.edu



                                    Abstract. Higher education students who either do not complete the subjects they
                                    enrolled in or interrupt indefinitely their studies without certification, the so-
                                    called college dropout problem, still continues to be a major concern for practi-
                                    tioners and researchers. Within the subjects, an early prediction of dropout stu-
                                    dents has aided teachers to focus their intervention in order to reduce dropout
                                    rates. Several machine-learning techniques have been used to classify/predict
                                    dropout students, including the tree-based methods which are not the best per-
                                    formers, but in their favour, are easily interpretable. This study presents a proce-
                                    dure to identify dropout-prone students at an early stage in an online statistics
                                    module, based on decision tree models. Although the attributes initially consid-
                                    ered in the creation of the trees were mainly related to quiz completion, partici-
                                    pation in the forum and access to the bulletin board, the final models show that
                                    the former is the only attribute with significant discriminatory power. We have
                                    evaluated the classification performance by means of a validation set. The per-
                                    formance measure of accuracy shows values above 90%, whereas that of recall
                                    and precision slightly under 90%.

                                    Keywords: Dropout prediction, decision trees, quiz completion, online educa-
                                    tion.


                          1         Introduction

                          Among education practitioners and reseachers, students who do not complete a single
                          module/subject or indefinitely interrupt their studies without having achieved the cer-
                          tificate have been a matter of considerable concern for a long time. These students are
                          usually called dropout students. In online courses, the high dropout rates of students
                          justify the abundant research on this particular topic, as shown in the extensive review
                          of [1], where 159 studies published between 1999 and 2009 were analysed. More re-
                          cently, in the European framework, reducing the dropout student rate in higher educa-
                          tion is considered a key strategy to attain the ambitious objective of not less than 40%
                          of people in their thirties who have completed higher education studies by 2020 [2].
                          Concerned as teachers and guided by European strategy, the authors have decided to
                          carry out research on dropout students in the statistics module at the Universitat Oberta
                          de Catalunya.




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             101




                             In a higher education context, two levels of dropout can be differentiated: (a) the
                          micro-level dropout, and (b) the macro-level one. In the former, the fact of dropout
                          takes place inside the module or subject [3], where teachers can intervene in case they
                          have convenient information at an early stage in order to reduce it. In line with that,
                          Burgos [4] shows a reduction of 14% in dropout rates of students by means of a tutoring
                          plan action after the dropout-prone students have been identified early. In macro-level
                          dropout, withdrawal from studies occurs, in general, outside the subjects so that the
                          interventions are the responsibility of other staff different from the teachers of the sub-
                          ject.
                             The main purpose of the present study is to design a procedure to identify as many
                          dropout-prone students as possible in an online statistics module, as soon as possible.
                          This procedure is based on the prediction/classification provided by binary conditional
                          decision trees generated in several instants of time throughout the module duration,
                          from the data related, mainly to test completion and participation in both the online
                          forum and the bulletin board.


                          2         Literature review

                          According to [1], there is an absence of consensus on the definition of both the micro-
                          level dropout and the macro-level one. With regard to the latter, even online and face-
                          to-face universities do not share the dropout definition [5]. Grau-Valldosera [5] claims
                          the time accepted without any enrolled subjects in an online university has to be ex-
                          tended compared with that in a face-to-face university because of the students’ charac-
                          teristics.
                             As illustrations of the micro-level dropout definitions, we have chosen the three that
                          follow. First, Liu [6] straightforwardly associates subject dropout with subject failure.
                          Dropout students are those who do not attain A, B, or C, that is, those who fail the
                          subject. Second, Levy [7] defines dropout students as those who do not complete the
                          subject and their tuition fees have not been refunded. And third, Dupin [8] considers
                          dropout students as those who are non-completers, understood in a broad sense.
                             The studies about dropout students by Cohen [3], Burgos [4], Costa [9], Santana
                          [10], Lykourentzou [11], Lara [12] and Kotsiantis [13] are focused on the micro level
                          (university subjects), all in an online but [3] blended environment. In addition, all of
                          them are concerned with early prediction and show considerable high values of several
                          evaluation measures of classification performance, such as accuracy, recall, precision
                          or F1-measure. Cohen [3] reports a maximum precision of 80%, Burgos [4] a recall of
                          96.73%, Costa [9] a maximum F1-measure of 82%, Santana [10] a maximum accuracy
                          of 86%, Lykourentzou [11] a maximum recall of 95% and Kotsiantis [13] a maximum
                          accuracy of 83.89%. Lara [12] found an accuracy above 90%, a figure that is “a very
                          acceptable percentage for the problem domain” [12, pp. 31]. In the following four par-
                          agraphs, we present a comparative review between [3-4, 9-14] regarding dropout def-
                          inition, single/multiple predicting instants of time, attributes selected as predictors and
                          classification method to carry out the prediction.




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                          102              LASI Spain 2019: Learning Analytics in Higher Education




                              The dropout definition from the failure perspective [6] is the one used in the studies
                          of Cohen [3], Costa [9] and Santana [10]. The definition of Levy [7] is explicitly men-
                          tioned in Lykourentzou [11], who adds another requirement: that the dropout student
                          has to access the e-learning platform at least once throughout the subject duration. That
                          means the student has to leave a trace in the information system before leaving the
                          subject in order to be considered a dropout student. For Burgos [4] and Lara [12] stu-
                          dents who do not sit the final exam are those defined as dropout students. And finally,
                          Kotsiantis [13] does not precisely define the non-completer students.
                              Predicting in a single instant of time is the option chosen by Santana [10] and Kotsi-
                          antis [13]. The latter argues that prediction has to be released before the subject is half
                          over because otherwise it would not be useful for the teachers to intervene in time.
                          Santana [10] predicts dropouts after the first exam, which also coincides with half of
                          the subject duration. In contrast, multiple instants of time, albeit not the same ones, are
                          contained in the proposals of [3-4, 9, 11-12]. Lykourentzou [11] released predictions
                          into each of the 7 sections that the subject is divided into. Similarly, Burgos [4] predicts
                          in each of the 12 assessment activities. The proposals of [3,9,12], based mainly on reg-
                          ular time intervals, are slightly different: Cohen [3] predicts dropouts monthly, in a one
                          semester course, Lara [9] weekly in 15-20 week courses, and finally Costa [9] also
                          weekly in a 10-week course and after releasing the mid-course exam marks.
                              All the attributes employed in [3-4, 9-13] can be grouped into three main categories:
                          demographics, usage of educational tools, and assessment activities or exam perfor-
                          mance. The first category is formed by time-invariant data available at the beginning
                          of the course, whereas the other two categories include time-varying data which are
                          incrementally collected throughout the course. Demographic attributes such as gender
                          and professional information are used by [9-11, 13] alike. Some studies also consider
                          other specific demographic attributes, like English language literacy [13]. The usage of
                          educational tools in general, and particularly participation in the forum is included in
                          the set of attributes that form the models of Cohen [3], Costa [9], Santana [10], Lykou-
                          rentzou [11] and Lara [12]. Finally, the marks attained in assessment activities or exams
                          are analysed in the studies of Burgos [4], Costa [9], Santana [10], Lykourentzou [11]
                          and Kotsiantis [13].
                              Regarding classification methods, apart from Cohen [3] who uses a unique method
                          based on comparing changes in attribute values of a student with respect to the mean of
                          attribute values of the whole group of students, the studies of [4, 9-13] use a great va-
                          riety of machine-learning techniques. Algorithms based on neural networks and support
                          vector machines are common to [4, 9-13], whereas naive Bayes and decision tree clas-
                          sifiers are only employed by Costa [9], Santana [10] and Kotsiantis [13]. Finally, lo-
                          gistic regression is also included in the set of classifiers of Burgos [4], Lara [12] and
                          Kotsiantis [13].
                              Although the study of Romero [14] does not explicitly mention the dropout problem,
                          as it aims to predict the final performance of students by classing them as passed or
                          failed, it could be deemed as a dropout problem according to Liu’s definition [6]. More-
                          over, like some of the references previously reviewed, an early prediction is released,
                          and the usage of the forum is the source of information to feed the attributes. The study




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             103




                          stands out for the comparative performance of 14 classification algorithms and reaches
                          the conclusion that the sequential minimal optimization (SMO) algorithm, related with
                          support vector machines, is the better performer. It is worth recalling that the studies of
                          [3-4, 9-13] all included that machine-learning technique.
                             The high dropout rates are also a major source of concern in Massive Open Online
                          Courses [15] and, in order to reduce them, several studies have dealt with their early
                          prediction [15-17]. These studies differ both in the type of dependent variables and the
                          machine learning methods used in their models. First, whereas the studies by Ruiperez-
                          Valiente [15] and Sharma [16] include the scores awarded after assignment submission,
                          Yang [17]’s only takes into account the behaviour in the discussion forum. And second,
                          prediction algorithms based on artificial neural networks are the ones chosen by Sharma
                          [16], while Ruiperez-Valiente [15] implemented random forests, generalised boosted
                          regression modelling, K-nearest neighbours and a logistic regression, and Yang [17]
                          used a survival model. Sharma [16] finds a relationship between students failing in
                          assignments and dropping out of the course.


                          3         Methodology

                          3.1       Participants and learning context

                          The participants in this study were the 197 students enrolled in the first semester of the
                          2018/19 fully asynchronous online one-semester statistics module, which formed part
                          of the Computer Engineering degree at the Universitat Oberta de Catalunya.
                             The teaching plan for this statistics module allowed students to complete optional
                          quizzes (Quizzes) and constructed-response questions (R.Questions) that had to be
                          solved by using the statistical program R. Six different pairs (Quiz, Rquestion), named
                          continuous assessment tests, were scheduled throughout the semester. Quizzes were
                          corrected and marked immediately, providing automated feedback. R.Questions re-
                          quired manual teacher correction and feedback was delayed. The scores attained, which
                          formed part of the continuous assessment mark, could be included in the final mark.
                          The module included two assessment instruments: (a) a compulsory in-person final
                          exam, and (b) non-compulsory online continuous assessment throughout the semester.
                          The final mark for the module was mainly based on the final exam mark, which could
                          be modified slightly by the continuous assessment mark. In addition, during the first
                          week teachers assigned an initial test to ascertain students’ prior knowledge of second-
                          ary-education statistics. In order to encourage participation, students who voluntarily
                          completed and submitted the test obtained a bonus, which also formed part of the con-
                          tinuous assessment mark.
                             An e-learning platform provides students enrolled in the statistics module of the Uni-
                          versitat Oberta de Catalunya with a communication tool: a forum, and an information
                          tool: a bulletin board. The latter was used by teachers to upload course information
                          which was mostly only accessible by students via that bulletin board. The former al-
                          lowed students and teachers to interact with each other, in general, asynchronously. The
                          e-learning platform also included direct access to view the teaching plan, which




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                          104              LASI Spain 2019: Learning Analytics in Higher Education




                          contained precise information about the assessment system. All reading access to the
                          bulletin board, forum and teaching plan, as well as writing access to the forum were
                          recorded by the information system of the Universitat Oberta de Catalunya.


                          3.2       Measure and data collection

                          The data has been collected in four instants of time, which coincide with the first four
                          continuous assessment test submission deadlines, the only ones in the first half of the
                          course. The separation between submission deadlines is variable, ranging from 1 to 3
                          weeks. We define four periods of time (Period.1, ..., Period.4) from the previous sub-
                          mission deadline as follows: Period.1 is the interval of time between the first day of the
                          semester and the first submission deadline, Period.2 is the interval of time between the
                          first and second submission deadlines, and so on for Period.3 and Period.4.
                              During the first period (Period.1), we gathered students’ register data such as the
                          number of courses enrolled on in the semester and whether they were repeater students
                          or not . This data, contained in the information system of the Universitat Oberta de
                          Catalunya and anonymously delivered to us, filled the instances of the attributes Re-
                          peating and Enrolled_Courses (see Table.1). The Moodle activity log was the source of
                          information to determine whether the student had submitted the initial test or not, and
                          likewise the first continuous assessment test. With that data, the instances of the attrib-
                          utes Initial_Test, Quiz_Till_Period.1 and R.Question_Till_Period.1 were filled (see Ta-
                          ble.1). The e-learning platform activity log provided the date and time of all access to
                          the platform which, after being pre-processed, filled the instances of the attributes
                          BBoard_Till_Period.1, Forum_Wr_Till_Period.1, Forum_Re_Till_Period.1 and
                          Teaching_Plan_Viewed_Till_Period.1 (see Table.1). All the previous data, transferred
                          to the second period (Period.2) and incremented with the specific information collected
                          in Period.2, filled the attributes ending in _Till_Period.2. This procedure was repeated
                          for Period.3 and Period.4 (see Table.1)

                                                        Table 1. Attributes for the Period.i, with i=1,..., 4


                              Name                                  Description                                       Types and Vaules
                              Repeating                             Indicates whether the student is                  Type: Boolean.
                                                                    repeating the subject or not                      Values: I.RP, N.RP
                              Enrolled _Courses                     Indicates the total of courses                    Type: Integer
                                                                    enrolled on in the semester.                      Values: {1, ...}
                              Initial_Test                          Indicates whether the student has or              Type: Boolean.
                                                                    has not completed and submitted the               Values: H.IT, N.IT
                                                                    initial test.
                              Teaching_Plan_Viewed                  Indicates whether the student has or              Type: Boolean.
                                                                    has not viewed the teaching plan                  Values:        H.TPV,
                                                                    until the last day of the Period.i                N.TPV
                              Quiz_Till_Period.i                    Indicates the number of quizzes                   Type: Integer
                                                                    completed and submitted until the                 Values: {0, 1, ..., i}
                                                                    last day of the Period.i




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             105




                              R.Question_Till_Period.i              Indicates the number of quizzes                   Type: Integer
                                                                    completed and submitted until the                 Values: {0, 1, ..., i}
                                                                    last day of the Period.i
                              BBoard_Till_Period.i                  Indicates the number of periods in                Type: Integer
                                                                    which the student has accessed the                Values: {0, 1, ..., i}
                                                                    board until the last day of the
                                                                    Period.i
                              Forum_Wr_Till_Period.i                Indicates the number of periods in                Type: Integer
                                                                    which the student has written                     Values: {0, 1, ..., i}
                                                                    messages on the forum until the last
                                                                    day of the Period.i
                              Forum_Re_Till_Period.i                Indicates the number of periods in                Type: Integer
                                                                    which the student has read messages               Values: {0, 1, ..., i}
                                                                    on the forum until the last day of the
                                                                    Period.i
                             The attribute selection of our study is based on the references in section Literature
                          review [4,9]. Nevertheless, we have not considered the scores of assessment activities
                          as attributes as [4] does. Instead, we have opted for the completion or non-completion
                          of Quizzes and R.Questions. There are two main reasons for this decision. The first
                          reason is Quizzes and R.Questions submission data are available faster than definite
                          marks since both R.Questions are marked manually (as mentioned in section 3.1), and
                          students may apply for marking reviews. The second reason is the likely high correla-
                          tion between completion of assessment activity and its mark, as is shown in a calculus
                          module of the same degree and in a very similar educational context [18].
                             In the present study, we have defined, based on [7], a dropout student as the student
                          who attains a final mark of “Not Completed”, which means the student has not taken
                          the compulsory final exam. That approach is in line with that of the [4, 12]. The boolean
                          response variable Y=Dropout indicates whether the student complies (I.Dropout) or not
                          (I.Completer) with the previous definition, that is, whether they belong to the dropout
                          student or to the completer student class. To fill the instances of that variable, the in-
                          formation system of the Universitat Oberta de Catalunya has anonymously delivered
                          the final marks to us. By combining attributes, like predictors, and response variable Y,
                          four sets of data are available, and each of them contains the instances of attributes of
                          each period and the instances of the Dropout variable.


                          3.3       Classification method

                          We pose a classification problem, the result of which will be a binary classification
                          model or binary classifier in order to predict whether a student will be classed as a
                          dropout student or completer student at the end of the semester. In addition, we require
                          the classifier to be easily interpretable, although at the expense of it not being the best
                          performer in terms of the usual evaluation measures of classification performance like
                          accuracy, precision or recall. Due to "tree-based methods being simple and useful for
                          interpretation " [19, pp. 303], we have decided to use those methods of classification in
                          our study. Basically, a binary decision tree is an oriented graph that starts in a node
                          called root, follows through arcs called branches, and ends in the terminal nodes, called




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                          106              LASI Spain 2019: Learning Analytics in Higher Education




                          leaves. Each nonterminal node, including the root, represents an attribute (a test on the
                          attribute), and each leaf represents one of the two classes (dropout student or completer
                          student) or the proportion of students that belong to each class. The branches that come
                          out of a node represent the values of the attribute associated with the node (the answer
                          to the test on the attribute) [20].
                              Given our attribute selection (see Table.1), we observe that not all the attributes have
                          the same number of possible values. A widely identified issue detected in studies using
                          decision tree models is the bias, in creating the nodes, to attributes with a large number
                          of possible values [21]. Conditional tree models mitigate that bias [21], and for that
                          reason those models are the classification methods we have chosen. To grow our con-
                          ditional trees, we have used the ctree() function provided by the statistical program R.
                              For each of the four data sets a classification model has been built (Model.1,
                          Model.2, Model.3, Model.4). In order to evaluate the performance of the models, firstly
                          the whole data set can be split into two mutually exclusive sets: the training set and the
                          validation one. Secondly, with the training set the classification model is fitted. And
                          finally, the evaluation of the performance is carried out using the validation set [19]. In
                          our study, we have conducted a random stratified split into a training set (80% of the
                          whole set) and a validation set (20%), keeping the same class distribution of the whole
                          set in each subset.
                              Taking into account that our main purpose was to identify dropout-prone students,
                          we have considered students predicted as dropouts, that is, those whose predicted class
                          is I.Dropout, as “Positive” cases, and the others, those whose predicted class is I.Com-
                          pleter, as “Negative” cases. Moreover, as usual, we differentiate between “True” or
                          “False” depending on whether the predicted class coincides with the observed class or
                          not, respectively. Table.2 depicts the four possible pairs when applying the validation
                          set to the model fitted with the training set.

                                              Table 2. Possible pairs in terms of predicted and observed classes
                                                                      Predicted class I.Dropout              Predicted class I.Completer
                              Observed class I.Dropout                    True Positive (TP)                       False Negative (FN)
                              Observed class I.Completer                  False Positive (FP)                      True Negative (TN)
                             In our study we have decided to use three evaluation measures of the classification
                          performance: Accuracy (1), Precision (2) and Recall (3), according to the following
                          definitions [19] :
                                                                                          𝑇𝑃+𝑇𝑁
                                                                 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                                                     (1)
                                                                                     𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃
                                                                                             𝑇𝑃
                                                                      𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                                               (2)
                                                                                           𝑇𝑃+𝐹𝑃
                                                                                           𝑇𝑃
                                                                         𝑅𝑒𝑐𝑎𝑙𝑙 =                                                               (3)
                                                                                        𝑇𝑃+𝐹𝑁




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             107




                          4         Results

                          The four classifiers created from the training set are extremely simple, each of them
                          contains one single node. Table.3 depicts each model as a decision rule. The only at-
                          tribute shown in the models, a result that reveals that it is the one with the strongest
                          association with the response Dropout [20], is the completion of Quizzes
                          (Quiz_Till_Period.i). Using the first three models (Model.1, Model.2, Model.3), stu-
                          dents are classified/predicted as dropout students (class I.Dropout) if they have not
                          completed all the Quizzes scheduled until the end of the period associated with the
                          model, in other words, if they have not completed one or more of those Quizzes. As an
                          example, at the end of Period.3, students that has not completed all three Quizzes cor-
                          responding to the first three continuous assessment tests are classified/predicted as a
                          dropout students. Only students who have completed all three Quizzes are classi-
                          fied/predicted as completer students (class I.Completer). In Model.4, the last condition
                          is softened, so that students are classified/predicted as completer students even if they
                          have not completed all four Quizzes. They can have decided to skip one Quiz, at the
                          most.

                                   Table 3. Decision rules for the four models: Model.1, Model.2, Model.3, Model.4
                              Model.1*:                                                    Model.2*:

                              IF Quiz_Till_Period.1=1                                      IF Quiz_Till_Period.2=2
                                  THEN I.Completer                                            THEN I.Completer
                                  ELSE I.Dropout                                              ELSE I.Dropout

                              Model.3*:                                                    Model.4*:

                              IF Quiz_Till_Period.3=3                                      IF Quiz_Till_Period.4 >2
                                 THEN I.Completer                                             THEN I.Completer
                                 ELSE I.Dropout                                               ELSE I.Dropout

                          * p-value < 0.001. H0: D(Dropout | Quiz_Till_Period.i) = D(Dropout), that is, H0: The response
                          Dropout is independent of the predictor Quiz_Till_Period.i
                             Using the validation test, three results stand out in the evaluation measures of the
                          classification performance (see Table.4). First, Accuracy shows a gradual increase from
                          the first model and, in Model.3 passes the figure of 90%, which is considered acceptable
                          by [12]. Second, Precision also grows from the second model and reaches the value
                          100% in Model.4. And third, Recall also rises from the first model and reaches its high-
                          est value in Model.3, just when Accuracy attains the level of “acceptable”.




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                          108              LASI Spain 2019: Learning Analytics in Higher Education




                                                                   Table 4. Performance measures

                                                                             Recall           Precision             Accuracy
                                                         Model.1             53.8%             87.5%                 82.9%
                                                         Model.2             61.5%             80.0%                 82.9%
                                                         Model.3             84.6%             84.6%                 90.2%
                                                         Model.4             84.6%             100%                  95.1%


                          5           Discussion

                          From the very beginning we aimed to find a classification model that was easily inter-
                          pretable, even at the expense of not finding the best performer classifier. The four clas-
                          sification models (see Table.3) entirely comply with the previous requirement. In the
                          rest of the section, we discuss the following three statements: (a) completing evaluative
                          quizzes is the only attribute that determines the classification process, (b) the simplicity
                          of the models eases the creation of an overall classification procedure that includes all
                          the models, and (c) applying the models separately, Model.3 is the best.
                              Above all, it is worth noticing that only one attribute, the Quiz_Till_Period.i, inter-
                          venes in the classification process as the four models show (see Table.3). The
                          Quiz_Till_Period.i attribute, directly related with the completion of Quizzes, has basi-
                          cally an evaluative character, which sets it apart from the attributes related to the usage
                          of the e-learning platform, such as the forum. The dominance of evaluative attributes is
                          in line with the study of Costa [9], who found that the most important attribute was the
                          midterm marks. On the other hand, completion of R.Questions likewise has evaluative
                          character, but nonetheless does not intervene in the final models. The main difference
                          between Quizzes and R.Questions lies in that the latter require students to apply higher
                          level skills than the former. Consequently, it seems reasonable to argue that students
                          who do not even complete the least-demanding assessment assignments, such as Quiz-
                          zes, are the most prone to becoming dropout students. And last but not least, the three
                          first models separate dropout and completer students depending on whether they have
                          or have not completed all the Quizzes scheduled until the moment the model is applied.
                          Therefore, we can interpret that the continued “doing” of Quizzes is the relevant aspect
                          in differentiating those who complete the course from those who do not.
                              The simplicity of the model reduces the volume of information actually being used
                          to only that related to completion of Quizzes, which in turn entails two beneficial con-
                          sequences: (a) the obvious elimination of time spent gathering and processing the rest
                          of attributes, (b) the teacher himself/herself can collect the required data directly from
                          the Moodle activity log. Using the three first models in cascade, at the end of the first
                          continuous assessment test submission deadline, the teacher can create a list of dropout-
                          prone students by selecting those who have not completed the first Quiz. After the sec-
                          ond submission deadline, the teacher can add new dropout-prone students to the previ-
                          ous list by selecting those who have not completed the second Quiz, and likewise re-
                          garding those who have not complete the third one. So, by following that simple pro-
                          cedure the teacher step by step adds to the list of dropout-prone students, which can be




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             109




                          useful when deciding possible measures in order to change the unsuccessful predicted
                          result.
                              The performance measures (see Table.4) indicate that, for Model.1, Precision is quite
                          high, but Recall is not, which can be interpreted as follows: that a limited amount of
                          students classed as a I.Dropout will eventually become completers, whereas a signifi-
                          cant number of students classed as I.Completer will finally become dropouts. As a re-
                          sult, a limited number of students can be the target of unnecessary teacher intervention,
                          but what is worse, a significant number of students will be outside the scope of teacher
                          intervention, which would have been useful if they had been correctly classified. Due
                          to the fact that our purpose is to identify as many dropout-prone students as possible,
                          Recall prevails over Precision. As a consequence, Model.1 turns out to have a low de-
                          gree of satisfaction. Model.2 is slightly more satisfactory than the Model.1 because of
                          its higher Recall, but Model.3 is the best option owing to its reasonably high values of
                          Precision, Recall, and also Accuracy (90.2%, which is therefore acceptable according
                          to [12]). Moreover, Model.3 can be applied after the seventh week of the course, some
                          way before the halfway point of the semester. And finally, because Model.4’s Recall
                          does not improve that of the Model.3, and given that our purpose included identification
                          “as soon as possible”, we can state that Model.3 is better than Model.4.


                          6         Conclusion and further research

                          The main contribution of the present study is to provide a simple and easy-to-use pro-
                          cedure, by means of several classification conditional tree-based models, to identify
                          dropout-prone students before the halfway point of the semester. Firstly, it is simple
                          since there is only a single attribute that contributes to classifying students. That attrib-
                          ute is related to students’ behaviour with respect to the completion of low-stake assess-
                          ment assignments such as quizzes posed by teachers and not related to the usage of the
                          e-learning platform, like forum participation. And secondly, it is easy to use because
                          simply by knowing every time a student has not completed one of the first three posed
                          quizzes is enough to identify him/her directly as a dropout-prone student. Furthermore,
                          because the information required is not only easily accessible by the teacher, but also
                          does not need to be processed, teachers can control the procedure by themselves and
                          implement it once the first quiz is submitted. If the performance measures entail a seri-
                          ous concern for the teacher, the previous procedure has to be modified in some way,
                          although it remains simple and easy-to-use. The procedure consists of checking whether
                          students have completed all of the first three quizzes. If the answer is no, the student is
                          identified as a dropout student.
                             According to the methodology selected, the students that belong to the training set,
                          with whom the classification models have been fitted, and the students of the validation
                          set, whose performance has been evaluated, are enrolled all together in the same aca-
                          demic year. This limitation could lead to further research. The studies of Lykourentzou
                          [11], Lara [12] and Kotsiantis [13], which create the training set in one academic period
                          and the test set in a different one, are references that it would be useful to bear in mind.
                          A second aspect that could be included in further research is the extension of the




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                          110              LASI Spain 2019: Learning Analytics in Higher Education




                          identification procedure to the fail-prone students [6], so that a richer approach to the
                          dropout prediction problem could be achieved.


                          Acknowledgements

                          This paper has been partially supported by a Fundació IBADA grant. We would like to
                          thank Dr Laura Calvet and Mr Paul Garbutt for their valuable contributions in helping
                          to improve this study.


                          References
                            1. Lee, Y., Choi, J.: A review of online course dropout research: Implications for practice and
                               future research. Educ. Technol. Res. Dev. 59, 593–618 (2011).
                            2. Vossensteyn, H., Kottmann, A., Jongbloed, B., Kaiser, F., Cremonini, L., Stensaker, B.,
                               Hovdhaugen, E., Wollscheid, S.: Drop-Out and Completion in Higher Education in Europe
                               - Literature Review. (2015).
                            3. Cohen, A.: Analysis of student activity in web-supported courses as a tool for predicting
                               dropout. Educ. Technol. Res. Dev. 65, 1285–1304 (2017).
                            4. Burgos, C., Campanario, M.L., Peña, D. de la, Lara, J.A., Lizcano, D., Martínez, M.A.: Data
                               mining for modeling students’ performance: A tutoring action plan to prevent academic
                               dropout. Comput. Electr. Eng. 66, 541–556 (2018).
                            5. Grau-Valldosera, J., Minguillón, J.: Rethinking dropout in online higher education: The case
                               of the universitat oberta de catalunya. Int. Rev. Res. Open Distance Learn. 15, 290–308
                               (2014).
                            6. Liu, S., Gomez, J., Yen, C.-J.: Community College Online Course Retention and Final
                               Grade: Predictability of Social Presence. J. Interact. Online Learn. 8, 165–182 (2009).
                            7. Levy, Y.: Comparing dropouts and persistence in e-learning courses. Comput. Educ. 48,
                               185–204 (2007).
                            8. Dupin-bryant, P.A.: Pre-Entry Variables Related to Retention in Online Distance Education.
                               Am. J. Distance Educ. 18, 199–206 (2011).
                            9. Costa, E.B., Fonseca, B., Santana, M.A., de Araújo, F.F., Rego, J.: Evaluating the effective-
                               ness of educational data mining techniques for early prediction of students’ academic failure
                               in introductory programming courses. Comput. Human Behav. 73, 247–256 (2017).
                           10. Santana, M.A., Costa, E.B., Neto, B.F.S., Silva, I.C.L., Rego, J.B.A.: A predictive model for
                               identifying students with dropout profiles in online courses. In: Workshop Proceedings of
                               the EDM 2015 International Conference on Educational Data Mining Vol 1446.
                           11. Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, G., Loumos, V.: Dropout pre-
                               diction in e-learning courses through the combination of machine learning techniques. Com-
                               put. Educ. 53, 950–965 (2009).
                           12. Lara, J.A., Lizcano, D., Martínez, M.A., Pazos, J., Riera, T.: A system for knowledge dis-
                               covery in e-learning environments within the European Higher Education Area - Application
                               to student data from Open University of Madrid, UDIMA. Comput. Educ. 72, 23–36 (2014).
                           13. Kotsiantis, S.B., Pierrakeas, C.J., Pintelas, P.E.: Preventing Student Dropout in Distance
                               Learning Using Machine Learning Techniques. In: Proceeding of the 7th International Con-
                               ference on Knowledge-Based Intelligent Information and Engineering Systems, KES 2003,
                               pp. 267–274. , Oxford, UK (2003).




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.
                                           LASI Spain 2019: Learning Analytics in Higher Education                                             111




                           14. Romero, C., López, M.I., Luna, J.M., Ventura, S.: Predicting students’ final performance
                               from participation in on-line discussion forums. Comput. Educ. 68, 458–472 (2013).
                           15. Ruiperez-Valiente JA, Muñoz-Merino PJ, Andújar A, Delgado-Kloos C.: Early Prediction
                               and Variable Importance of Certificate Accomplishment in a MOOC. Proceedings of the
                               European Conference on Massive Open Online Courses, 263-272 (2017).
                           16. Sharma K, Kidzinski L, Jermann P, Dillenbourg P.: Towards Predicting Success in MOOCs:
                               Programming Assignments. Proceedings of the European Stakehold SUMMIT on Experi-
                               ences and Best Practices Around MOOCs (EMOOCS), 135–148 (2016).
                           17. Yang D, Sinha T, Adamson D, Penstein Rose C.: “Turn on, Tune in, Drop out”: Anticipating
                               Student Dropouts in Massive Open Online Courses. Proceedings of the 2013 NIPS Data-
                               driven education workshop, 1-8 (2013).
                           18. Figueroa-Cañas J, Sancho-Vinuesa T.: Investigating the relationship between optional quiz-
                               zes and final exam performance in a fully asynchronous online calculus module. Interact
                               Learn Environ. (2018).
                           19. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning.
                               (2013).
                           20. Kotsiantis SB, Zaharakis ID, Pintelas PE (2006) Machine learning: A review of classifica-
                               tion and combining techniques. Artif Intell Rev 26:159–190.
                           21. Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: A conditional infer-
                               ence framework. Research Report Series 8, Department of Statistics and Mathematics, WU
                               Wien, 2004. J. Comput. Graph. Stat. 15, 651–674 (2006).




Copyright © 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.