<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting early dropout students is a matter of checking completed quizzes: the case of an online statistics module</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Oberta de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>100</fpage>
      <lpage>111</lpage>
      <abstract>
        <p>Higher education students who either do not complete the subjects they enrolled in or interrupt indefinitely their studies without certification, the socalled college dropout problem, still continues to be a major concern for practitioners and researchers. Within the subjects, an early prediction of dropout students has aided teachers to focus their intervention in order to reduce dropout rates. Several machine-learning techniques have been used to classify/predict dropout students, including the tree-based methods which are not the best performers, but in their favour, are easily interpretable. This study presents a procedure to identify dropout-prone students at an early stage in an online statistics module, based on decision tree models. Although the attributes initially considered in the creation of the trees were mainly related to quiz completion, participation in the forum and access to the bulletin board, the final models show that the former is the only attribute with significant discriminatory power. We have evaluated the classification performance by means of a validation set. The performance measure of accuracy shows values above 90%, whereas that of recall and precision slightly under 90%.</p>
      </abstract>
      <kwd-group>
        <kwd>Dropout prediction</kwd>
        <kwd>decision trees</kwd>
        <kwd>quiz completion</kwd>
        <kwd>online education</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Among education practitioners and reseachers, students who do not complete a single
module/subject or indefinitely interrupt their studies without having achieved the
certificate have been a matter of considerable concern for a long time. These students are
usually called dropout students. In online courses, the high dropout rates of students
justify the abundant research on this particular topic, as shown in the extensive review
of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where 159 studies published between 1999 and 2009 were analysed. More
recently, in the European framework, reducing the dropout student rate in higher
education is considered a key strategy to attain the ambitious objective of not less than 40%
of people in their thirties who have completed higher education studies by 2020 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Concerned as teachers and guided by European strategy, the authors have decided to
carry out research on dropout students in the statistics module at the Universitat Oberta
de Catalunya.
      </p>
      <p>
        In a higher education context, two levels of dropout can be differentiated: (a) the
micro-level dropout, and (b) the macro-level one. In the former, the fact of dropout
takes place inside the module or subject [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where teachers can intervene in case they
have convenient information at an early stage in order to reduce it. In line with that,
Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] shows a reduction of 14% in dropout rates of students by means of a tutoring
plan action after the dropout-prone students have been identified early. In macro-level
dropout, withdrawal from studies occurs, in general, outside the subjects so that the
interventions are the responsibility of other staff different from the teachers of the
subject.
      </p>
      <p>The main purpose of the present study is to design a procedure to identify as many
dropout-prone students as possible in an online statistics module, as soon as possible.
This procedure is based on the prediction/classification provided by binary conditional
decision trees generated in several instants of time throughout the module duration,
from the data related, mainly to test completion and participation in both the online
forum and the bulletin board.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature review</title>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], there is an absence of consensus on the definition of both the
microlevel dropout and the macro-level one. With regard to the latter, even online and
faceto-face universities do not share the dropout definition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Grau-Valldosera [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] claims
the time accepted without any enrolled subjects in an online university has to be
extended compared with that in a face-to-face university because of the students’
characteristics.
      </p>
      <p>
        As illustrations of the micro-level dropout definitions, we have chosen the three that
follow. First, Liu [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] straightforwardly associates subject dropout with subject failure.
Dropout students are those who do not attain A, B, or C, that is, those who fail the
subject. Second, Levy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] defines dropout students as those who do not complete the
subject and their tuition fees have not been refunded. And third, Dupin [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] considers
dropout students as those who are non-completers, understood in a broad sense.
      </p>
      <p>
        The studies about dropout students by Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Santana
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are focused on the micro level
(university subjects), all in an online but [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] blended environment. In addition, all of
them are concerned with early prediction and show considerable high values of several
evaluation measures of classification performance, such as accuracy, recall, precision
or F1-measure. Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] reports a maximum precision of 80%, Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] a recall of
96.73%, Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] a maximum F1-measure of 82%, Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] a maximum accuracy
of 86%, Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] a maximum recall of 95% and Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] a maximum
accuracy of 83.89%. Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] found an accuracy above 90%, a figure that is “a very
acceptable percentage for the problem domain” [12, pp. 31]. In the following four
paragraphs, we present a comparative review between [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref3 ref4 ref9">3-4, 9-14</xref>
        ] regarding dropout
definition, single/multiple predicting instants of time, attributes selected as predictors and
classification method to carry out the prediction.
      </p>
      <p>
        The dropout definition from the failure perspective [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is the one used in the studies
of Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The definition of Levy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is explicitly
mentioned in Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], who adds another requirement: that the dropout student
has to access the e-learning platform at least once throughout the subject duration. That
means the student has to leave a trace in the information system before leaving the
subject in order to be considered a dropout student. For Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
students who do not sit the final exam are those defined as dropout students. And finally,
Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] does not precisely define the non-completer students.
      </p>
      <p>
        Predicting in a single instant of time is the option chosen by Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The latter argues that prediction has to be released before the subject is half
over because otherwise it would not be useful for the teachers to intervene in time.
Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] predicts dropouts after the first exam, which also coincides with half of
the subject duration. In contrast, multiple instants of time, albeit not the same ones, are
contained in the proposals of [
        <xref ref-type="bibr" rid="ref11 ref12 ref3 ref4 ref9">3-4, 9, 11-12</xref>
        ]. Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] released predictions
into each of the 7 sections that the subject is divided into. Similarly, Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] predicts
in each of the 12 assessment activities. The proposals of [
        <xref ref-type="bibr" rid="ref12 ref3 ref9">3,9,12</xref>
        ], based mainly on
regular time intervals, are slightly different: Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] predicts dropouts monthly, in a one
semester course, Lara [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] weekly in 15-20 week courses, and finally Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] also
weekly in a 10-week course and after releasing the mid-course exam marks.
      </p>
      <p>
        All the attributes employed in [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref3 ref4 ref9">3-4, 9-13</xref>
        ] can be grouped into three main categories:
demographics, usage of educational tools, and assessment activities or exam
performance. The first category is formed by time-invariant data available at the beginning
of the course, whereas the other two categories include time-varying data which are
incrementally collected throughout the course. Demographic attributes such as gender
and professional information are used by [
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref9">9-11, 13</xref>
        ] alike. Some studies also consider
other specific demographic attributes, like English language literacy [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The usage of
educational tools in general, and particularly participation in the forum is included in
the set of attributes that form the models of Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Finally, the marks attained in assessment activities or exams
are analysed in the studies of Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Lykourentzou [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
and Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Regarding classification methods, apart from Cohen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] who uses a unique method
based on comparing changes in attribute values of a student with respect to the mean of
attribute values of the whole group of students, the studies of [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref4 ref9">4, 9-13</xref>
        ] use a great
variety of machine-learning techniques. Algorithms based on neural networks and support
vector machines are common to [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref4 ref9">4, 9-13</xref>
        ], whereas naive Bayes and decision tree
classifiers are only employed by Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Santana [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Finally,
logistic regression is also included in the set of classifiers of Burgos [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Although the study of Romero [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] does not explicitly mention the dropout problem,
as it aims to predict the final performance of students by classing them as passed or
failed, it could be deemed as a dropout problem according to Liu’s definition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Moreover, like some of the references previously reviewed, an early prediction is released,
and the usage of the forum is the source of information to feed the attributes. The study
stands out for the comparative performance of 14 classification algorithms and reaches
the conclusion that the sequential minimal optimization (SMO) algorithm, related with
support vector machines, is the better performer. It is worth recalling that the studies of
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref3 ref4 ref9">3-4, 9-13</xref>
        ] all included that machine-learning technique.
      </p>
      <p>
        The high dropout rates are also a major source of concern in Massive Open Online
Courses [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and, in order to reduce them, several studies have dealt with their early
prediction [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15-17</xref>
        ]. These studies differ both in the type of dependent variables and the
machine learning methods used in their models. First, whereas the studies by
RuiperezValiente [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Sharma [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] include the scores awarded after assignment submission,
Yang [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]’s only takes into account the behaviour in the discussion forum. And second,
prediction algorithms based on artificial neural networks are the ones chosen by Sharma
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], while Ruiperez-Valiente [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] implemented random forests, generalised boosted
regression modelling, K-nearest neighbours and a logistic regression, and Yang [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
used a survival model. Sharma [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] finds a relationship between students failing in
assignments and dropping out of the course.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Participants and learning context</title>
        <p>The participants in this study were the 197 students enrolled in the first semester of the
2018/19 fully asynchronous online one-semester statistics module, which formed part
of the Computer Engineering degree at the Universitat Oberta de Catalunya.</p>
        <p>The teaching plan for this statistics module allowed students to complete optional
quizzes (Quizzes) and constructed-response questions (R.Questions) that had to be
solved by using the statistical program R. Six different pairs (Quiz, Rquestion), named
continuous assessment tests, were scheduled throughout the semester. Quizzes were
corrected and marked immediately, providing automated feedback. R.Questions
required manual teacher correction and feedback was delayed. The scores attained, which
formed part of the continuous assessment mark, could be included in the final mark.
The module included two assessment instruments: (a) a compulsory in-person final
exam, and (b) non-compulsory online continuous assessment throughout the semester.
The final mark for the module was mainly based on the final exam mark, which could
be modified slightly by the continuous assessment mark. In addition, during the first
week teachers assigned an initial test to ascertain students’ prior knowledge of
secondary-education statistics. In order to encourage participation, students who voluntarily
completed and submitted the test obtained a bonus, which also formed part of the
continuous assessment mark.</p>
        <p>An e-learning platform provides students enrolled in the statistics module of the
Universitat Oberta de Catalunya with a communication tool: a forum, and an information
tool: a bulletin board. The latter was used by teachers to upload course information
which was mostly only accessible by students via that bulletin board. The former
allowed students and teachers to interact with each other, in general, asynchronously. The
e-learning platform also included direct access to view the teaching plan, which
contained precise information about the assessment system. All reading access to the
bulletin board, forum and teaching plan, as well as writing access to the forum were
recorded by the information system of the Universitat Oberta de Catalunya.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Measure and data collection</title>
        <p>The data has been collected in four instants of time, which coincide with the first four
continuous assessment test submission deadlines, the only ones in the first half of the
course. The separation between submission deadlines is variable, ranging from 1 to 3
weeks. We define four periods of time (Period.1, ..., Period.4) from the previous
submission deadline as follows: Period.1 is the interval of time between the first day of the
semester and the first submission deadline, Period.2 is the interval of time between the
first and second submission deadlines, and so on for Period.3 and Period.4.</p>
        <p>During the first period (Period.1), we gathered students’ register data such as the
number of courses enrolled on in the semester and whether they were repeater students
or not . This data, contained in the information system of the Universitat Oberta de
Catalunya and anonymously delivered to us, filled the instances of the attributes
Repeating and Enrolled_Courses (see Table.1). The Moodle activity log was the source of
information to determine whether the student had submitted the initial test or not, and
likewise the first continuous assessment test. With that data, the instances of the
attributes Initial_Test, Quiz_Till_Period.1 and R.Question_Till_Period.1 were filled (see
Table.1). The e-learning platform activity log provided the date and time of all access to
the platform which, after being pre-processed, filled the instances of the attributes
BBoard_Till_Period.1, Forum_Wr_Till_Period.1, Forum_Re_Till_Period.1 and
Teaching_Plan_Viewed_Till_Period.1 (see Table.1). All the previous data, transferred
to the second period (Period.2) and incremented with the specific information collected
in Period.2, filled the attributes ending in _Till_Period.2. This procedure was repeated
for Period.3 and Period.4 (see Table.1)
R.Question_Till_Period.i Indicates the number of quizzes Type: Integer
completed and submitted until the Values: {0, 1, ..., i}
last day of the Period.i
BBoard_Till_Period.i Indicates the number of periods in Type: Integer
which the student has accessed the Values: {0, 1, ..., i}
board until the last day of the</p>
        <p>Period.i
Forum_Wr_Till_Period.i Indicates the number of periods in Type: Integer
which the student has written Values: {0, 1, ..., i}
messages on the forum until the last
day of the Period.i
Forum_Re_Till_Period.i Indicates the number of periods in Type: Integer
which the student has read messages Values: {0, 1, ..., i}
on the forum until the last day of the</p>
        <p>Period.i</p>
        <p>
          The attribute selection of our study is based on the references in section Literature
review [
          <xref ref-type="bibr" rid="ref4 ref9">4,9</xref>
          ]. Nevertheless, we have not considered the scores of assessment activities
as attributes as [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] does. Instead, we have opted for the completion or non-completion
of Quizzes and R.Questions. There are two main reasons for this decision. The first
reason is Quizzes and R.Questions submission data are available faster than definite
marks since both R.Questions are marked manually (as mentioned in section 3.1), and
students may apply for marking reviews. The second reason is the likely high
correlation between completion of assessment activity and its mark, as is shown in a calculus
module of the same degree and in a very similar educational context [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>
          In the present study, we have defined, based on [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a dropout student as the student
who attains a final mark of “Not Completed”, which means the student has not taken
the compulsory final exam. That approach is in line with that of the [
          <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
          ]. The boolean
response variable Y=Dropout indicates whether the student complies (I.Dropout) or not
(I.Completer) with the previous definition, that is, whether they belong to the dropout
student or to the completer student class. To fill the instances of that variable, the
information system of the Universitat Oberta de Catalunya has anonymously delivered
the final marks to us. By combining attributes, like predictors, and response variable Y,
four sets of data are available, and each of them contains the instances of attributes of
each period and the instances of the Dropout variable.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Classification method</title>
        <p>
          We pose a classification problem, the result of which will be a binary classification
model or binary classifier in order to predict whether a student will be classed as a
dropout student or completer student at the end of the semester. In addition, we require
the classifier to be easily interpretable, although at the expense of it not being the best
performer in terms of the usual evaluation measures of classification performance like
accuracy, precision or recall. Due to "tree-based methods being simple and useful for
interpretation " [19, pp. 303], we have decided to use those methods of classification in
our study. Basically, a binary decision tree is an oriented graph that starts in a node
called root, follows through arcs called branches, and ends in the terminal nodes, called
leaves. Each nonterminal node, including the root, represents an attribute (a test on the
attribute), and each leaf represents one of the two classes (dropout student or completer
student) or the proportion of students that belong to each class. The branches that come
out of a node represent the values of the attribute associated with the node (the answer
to the test on the attribute) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>
          Given our attribute selection (see Table.1), we observe that not all the attributes have
the same number of possible values. A widely identified issue detected in studies using
decision tree models is the bias, in creating the nodes, to attributes with a large number
of possible values [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Conditional tree models mitigate that bias [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and for that
reason those models are the classification methods we have chosen. To grow our
conditional trees, we have used the ctree() function provided by the statistical program R.
        </p>
        <p>
          For each of the four data sets a classification model has been built (Model.1,
Model.2, Model.3, Model.4). In order to evaluate the performance of the models, firstly
the whole data set can be split into two mutually exclusive sets: the training set and the
validation one. Secondly, with the training set the classification model is fitted. And
finally, the evaluation of the performance is carried out using the validation set [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. In
our study, we have conducted a random stratified split into a training set (80% of the
whole set) and a validation set (20%), keeping the same class distribution of the whole
set in each subset.
        </p>
        <p>Taking into account that our main purpose was to identify dropout-prone students,
we have considered students predicted as dropouts, that is, those whose predicted class
is I.Dropout, as “Positive” cases, and the others, those whose predicted class is
I.Completer, as “Negative” cases. Moreover, as usual, we differentiate between “True” or
“False” depending on whether the predicted class coincides with the observed class or
not, respectively. Table.2 depicts the four possible pairs when applying the validation
set to the model fitted with the training set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        The four classifiers created from the training set are extremely simple, each of them
contains one single node. Table.3 depicts each model as a decision rule. The only
attribute shown in the models, a result that reveals that it is the one with the strongest
association with the response Dropout [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], is the completion of Quizzes
(Quiz_Till_Period.i). Using the first three models (Model.1, Model.2, Model.3),
students are classified/predicted as dropout students (class I.Dropout) if they have not
completed all the Quizzes scheduled until the end of the period associated with the
model, in other words, if they have not completed one or more of those Quizzes. As an
example, at the end of Period.3, students that has not completed all three Quizzes
corresponding to the first three continuous assessment tests are classified/predicted as a
dropout students. Only students who have completed all three Quizzes are
classified/predicted as completer students (class I.Completer). In Model.4, the last condition
is softened, so that students are classified/predicted as completer students even if they
have not completed all four Quizzes. They can have decided to skip one Quiz, at the
most.
* p-value &lt; 0.001. H0: D(Dropout | Quiz_Till_Period.i) = D(Dropout), that is, H0: The response
Dropout is independent of the predictor Quiz_Till_Period.i
      </p>
      <p>
        Using the validation test, three results stand out in the evaluation measures of the
classification performance (see Table.4). First, Accuracy shows a gradual increase from
the first model and, in Model.3 passes the figure of 90%, which is considered acceptable
by [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Second, Precision also grows from the second model and reaches the value
100% in Model.4. And third, Recall also rises from the first model and reaches its
highest value in Model.3, just when Accuracy attains the level of “acceptable”.
From the very beginning we aimed to find a classification model that was easily
interpretable, even at the expense of not finding the best performer classifier. The four
classification models (see Table.3) entirely comply with the previous requirement. In the
rest of the section, we discuss the following three statements: (a) completing evaluative
quizzes is the only attribute that determines the classification process, (b) the simplicity
of the models eases the creation of an overall classification procedure that includes all
the models, and (c) applying the models separately, Model.3 is the best.
      </p>
      <p>
        Above all, it is worth noticing that only one attribute, the Quiz_Till_Period.i,
intervenes in the classification process as the four models show (see Table.3). The
Quiz_Till_Period.i attribute, directly related with the completion of Quizzes, has
basically an evaluative character, which sets it apart from the attributes related to the usage
of the e-learning platform, such as the forum. The dominance of evaluative attributes is
in line with the study of Costa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], who found that the most important attribute was the
midterm marks. On the other hand, completion of R.Questions likewise has evaluative
character, but nonetheless does not intervene in the final models. The main difference
between Quizzes and R.Questions lies in that the latter require students to apply higher
level skills than the former. Consequently, it seems reasonable to argue that students
who do not even complete the least-demanding assessment assignments, such as
Quizzes, are the most prone to becoming dropout students. And last but not least, the three
first models separate dropout and completer students depending on whether they have
or have not completed all the Quizzes scheduled until the moment the model is applied.
Therefore, we can interpret that the continued “doing” of Quizzes is the relevant aspect
in differentiating those who complete the course from those who do not.
      </p>
      <p>The simplicity of the model reduces the volume of information actually being used
to only that related to completion of Quizzes, which in turn entails two beneficial
consequences: (a) the obvious elimination of time spent gathering and processing the rest
of attributes, (b) the teacher himself/herself can collect the required data directly from
the Moodle activity log. Using the three first models in cascade, at the end of the first
continuous assessment test submission deadline, the teacher can create a list of
dropoutprone students by selecting those who have not completed the first Quiz. After the
second submission deadline, the teacher can add new dropout-prone students to the
previous list by selecting those who have not completed the second Quiz, and likewise
regarding those who have not complete the third one. So, by following that simple
procedure the teacher step by step adds to the list of dropout-prone students, which can be
useful when deciding possible measures in order to change the unsuccessful predicted
result.</p>
      <p>
        The performance measures (see Table.4) indicate that, for Model.1, Precision is quite
high, but Recall is not, which can be interpreted as follows: that a limited amount of
students classed as a I.Dropout will eventually become completers, whereas a
significant number of students classed as I.Completer will finally become dropouts. As a
result, a limited number of students can be the target of unnecessary teacher intervention,
but what is worse, a significant number of students will be outside the scope of teacher
intervention, which would have been useful if they had been correctly classified. Due
to the fact that our purpose is to identify as many dropout-prone students as possible,
Recall prevails over Precision. As a consequence, Model.1 turns out to have a low
degree of satisfaction. Model.2 is slightly more satisfactory than the Model.1 because of
its higher Recall, but Model.3 is the best option owing to its reasonably high values of
Precision, Recall, and also Accuracy (90.2%, which is therefore acceptable according
to [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). Moreover, Model.3 can be applied after the seventh week of the course, some
way before the halfway point of the semester. And finally, because Model.4’s Recall
does not improve that of the Model.3, and given that our purpose included identification
“as soon as possible”, we can state that Model.3 is better than Model.4.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and further research</title>
      <p>The main contribution of the present study is to provide a simple and easy-to-use
procedure, by means of several classification conditional tree-based models, to identify
dropout-prone students before the halfway point of the semester. Firstly, it is simple
since there is only a single attribute that contributes to classifying students. That
attribute is related to students’ behaviour with respect to the completion of low-stake
assessment assignments such as quizzes posed by teachers and not related to the usage of the
e-learning platform, like forum participation. And secondly, it is easy to use because
simply by knowing every time a student has not completed one of the first three posed
quizzes is enough to identify him/her directly as a dropout-prone student. Furthermore,
because the information required is not only easily accessible by the teacher, but also
does not need to be processed, teachers can control the procedure by themselves and
implement it once the first quiz is submitted. If the performance measures entail a
serious concern for the teacher, the previous procedure has to be modified in some way,
although it remains simple and easy-to-use. The procedure consists of checking whether
students have completed all of the first three quizzes. If the answer is no, the student is
identified as a dropout student.</p>
      <p>
        According to the methodology selected, the students that belong to the training set,
with whom the classification models have been fitted, and the students of the validation
set, whose performance has been evaluated, are enrolled all together in the same
academic year. This limitation could lead to further research. The studies of Lykourentzou
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Lara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Kotsiantis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which create the training set in one academic period
and the test set in a different one, are references that it would be useful to bear in mind.
      </p>
      <p>
        A second aspect that could be included in further research is the extension of the
identification procedure to the fail-prone students [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], so that a richer approach to the
dropout prediction problem could be achieved.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This paper has been partially supported by a Fundació IBADA grant. We would like to
thank Dr Laura Calvet and Mr Paul Garbutt for their valuable contributions in helping
to improve this study.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A review of online course dropout research: Implications for practice and future research</article-title>
          .
          <source>Educ. Technol. Res. Dev</source>
          .
          <volume>59</volume>
          ,
          <fpage>593</fpage>
          -
          <lpage>618</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Vossensteyn</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kottmann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jongbloed</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stensaker</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovdhaugen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wollscheid</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Drop-Out and Completion in Higher Education in Europe - Literature Review</article-title>
          . (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Analysis of student activity in web-supported courses as a tool for predicting dropout</article-title>
          .
          <source>Educ. Technol. Res. Dev</source>
          .
          <volume>65</volume>
          ,
          <fpage>1285</fpage>
          -
          <lpage>1304</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Burgos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campanario</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peña</surname>
            , D. de la, Lara,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lizcano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martínez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Data mining for modeling students' performance: A tutoring action plan to prevent academic dropout</article-title>
          .
          <source>Comput. Electr. Eng</source>
          .
          <volume>66</volume>
          ,
          <fpage>541</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Grau-Valldosera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minguillón</surname>
          </string-name>
          , J.:
          <article-title>Rethinking dropout in online higher education: The case of the universitat oberta de catalunya</article-title>
          .
          <source>Int. Rev. Res. Open Distance Learn</source>
          .
          <volume>15</volume>
          ,
          <fpage>290</fpage>
          -
          <lpage>308</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yen</surname>
          </string-name>
          , C.-J.:
          <article-title>Community College Online Course Retention and Final Grade: Predictability of Social Presence</article-title>
          .
          <source>J. Interact. Online Learn</source>
          .
          <volume>8</volume>
          ,
          <fpage>165</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Comparing dropouts and persistence in e-learning courses</article-title>
          .
          <source>Comput. Educ</source>
          .
          <volume>48</volume>
          ,
          <fpage>185</fpage>
          -
          <lpage>204</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Dupin-bryant, P.A.:
          <article-title>Pre-Entry Variables Related to Retention in Online Distance Education</article-title>
          .
          <source>Am. J. Distance Educ</source>
          .
          <volume>18</volume>
          ,
          <fpage>199</fpage>
          -
          <lpage>206</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Costa</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fonseca</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santana</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Araújo</surname>
            ,
            <given-names>F.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rego</surname>
          </string-name>
          , J.:
          <article-title>Evaluating the effectiveness of educational data mining techniques for early prediction of students' academic failure in introductory programming courses</article-title>
          .
          <source>Comput. Human Behav</source>
          .
          <volume>73</volume>
          ,
          <fpage>247</fpage>
          -
          <lpage>256</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Santana</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costa</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>B.F.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>I.C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rego</surname>
            ,
            <given-names>J.B.A.</given-names>
          </string-name>
          :
          <article-title>A predictive model for identifying students with dropout profiles in online courses</article-title>
          .
          <source>In: Workshop Proceedings of the EDM 2015 International Conference on Educational Data Mining</source>
          Vol
          <volume>1446</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lykourentzou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannoukos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolopoulos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mpardis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loumos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Dropout prediction in e-learning courses through the combination of machine learning techniques</article-title>
          .
          <source>Comput. Educ</source>
          .
          <volume>53</volume>
          ,
          <fpage>950</fpage>
          -
          <lpage>965</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lara</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lizcano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martínez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riera</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A system for knowledge discovery in e-learning environments within the European Higher Education Area - Application to student data</article-title>
          from Open University of Madrid, UDIMA.
          <source>Comput. Educ</source>
          .
          <volume>72</volume>
          ,
          <fpage>23</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kotsiantis</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pierrakeas</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pintelas</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          :
          <article-title>Preventing Student Dropout in Distance Learning Using Machine Learning Techniques</article-title>
          .
          <source>In: Proceeding of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, KES 2003</source>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          . , Oxford, UK (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>López</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luna</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Predicting students' final performance from participation in on-line discussion forums</article-title>
          .
          <source>Comput. Educ</source>
          .
          <volume>68</volume>
          ,
          <fpage>458</fpage>
          -
          <lpage>472</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ruiperez-Valiente</surname>
            <given-names>JA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muñoz-Merino</surname>
            <given-names>PJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andújar</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delgado-Kloos</surname>
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Early Prediction and Variable Importance of Certificate Accomplishment in a MOOC</article-title>
          .
          <source>Proceedings of the European Conference on Massive Open Online Courses</source>
          ,
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sharma</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kidzinski</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jermann</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dillenbourg</surname>
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Towards Predicting Success in MOOCs: Programming Assignments</article-title>
          .
          <source>Proceedings of the European Stakehold SUMMIT on Experiences and Best Practices Around MOOCs (EMOOCS)</source>
          ,
          <fpage>135</fpage>
          -
          <lpage>148</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Yang</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adamson</surname>
            <given-names>D</given-names>
          </string-name>
          , Penstein Rose C.: “Turn on, Tune in, Drop out”:
          <article-title>Anticipating Student Dropouts in Massive Open Online Courses</article-title>
          .
          <source>Proceedings of the 2013 NIPS Datadriven education workshop</source>
          , 1-
          <fpage>8</fpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Figueroa-Cañas</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sancho-Vinuesa</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Investigating the relationship between optional quizzes and final exam performance in a fully asynchronous online calculus module</article-title>
          .
          <source>Interact Learn Environ</source>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>James</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An Introduction to Statistical Learning. (</article-title>
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kotsiantis</surname>
            <given-names>SB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaharakis</surname>
            <given-names>ID</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pintelas</surname>
            <given-names>PE</given-names>
          </string-name>
          (
          <year>2006</year>
          )
          <article-title>Machine learning: A review of classification and combining techniques</article-title>
          .
          <source>Artif Intell Rev</source>
          <volume>26</volume>
          :
          <fpage>159</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Hothorn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hornik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeileis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Unbiased recursive partitioning: A conditional inference framework</article-title>
          .
          <source>Research Report Series 8</source>
          , Department of Statistics and Mathematics, WU Wien,
          <year>2004</year>
          . J.
          <string-name>
            <surname>Comput</surname>
          </string-name>
          . Graph. Stat.
          <volume>15</volume>
          ,
          <fpage>651</fpage>
          -
          <lpage>674</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>