=Paper=
{{Paper
|id=Vol-3051/UGR_1
|storemode=property
|title=Using Course Evaluations and Student Data to Predict Computer Science Student Success
|pdfUrl=https://ceur-ws.org/Vol-3051/UGR_1.pdf
|volume=Vol-3051
|authors=Anlan Du,Alexandra Plukis,Huzefa Rangwala
|dblpUrl=https://dblp.org/rec/conf/edm/DuPR21
}}
==Using Course Evaluations and Student Data to Predict Computer Science Student Success==
<pdf width="1500px">https://ceur-ws.org/Vol-3051/UGR_1.pdf</pdf>
<pre>
Using Course Evaluations and Student Data to Predict Computer
                   Science Student Success
                      Anlan Du∗                                          Alexandra Plukis∗                             Huzefa Rangwala
            amd5wf@virginia.edu                                           aplukis@asu.edu                             hrangwal@gmu.edu
            University of Virginia                                    Arizona State University                      George Mason University
       Charlottesville, VA, United States                             Tempe, AZ, United States                      Fairfax, VA, United States

ABSTRACT                                                                               can more easily find patterns that reveal how different traits affect
As the field of computer science has grown, the question of how to                     student retention.
improve retention in computer science, especially for females and                         George Mason also offers a unique opportunity to analyze the
minorities, has grown increasingly important. Previous research                        impact of professor gender on student success. George Mason’s
has looked into attitudes among those who leave CS, as well as the                     engineering faculty is 26.8% female, more than 1.5 times higher
impact of taking specific courses; we build on this body of research                   than the national average of 15.7% [10][16]. A larger female faculty
using large-scale analysis of course evaluations and students’ aca-                    means that analyses of the impact of instructor gender are less likely
demic history. Our goal is to understand their potential connection                    to be swayed by a single professor and therefore more statistically
to a student’s performance and retention within the CS major. We                       significant.
process course-specific data, faculty evaluations, and student de-
mographic data through various machine learning-based classifiers                      2   RELATED WORK
to understand the predictive power of each feature. We find our
algorithm performs significantly better for higher-performing stu-                     Our work builds upon previous research regarding student both
dents than lower-performing ones, but do not find that evaluations                     college retention and achievement in courses, both generally and
significantly improve predictions of students doing well in courses                    between demographic groups [7] . Demographic disparities are par-
and staying in the major.                                                              ticularly evident in the number of degrees awarded. For instance,
                                                                                       during George Mason University’s 2017-2018 school year only 15.8%
KEYWORDS                                                                               of the total 196 computer science (CS) degrees were awarded to
                                                                                       females. This lack of representation is even more pronounced for
educational data mining, course evaluations, computer science,                         minority students—only 6 CS degrees were awarded to African
retention, grade prediction, algorithm fairness                                        American students and 16 awarded to their Hispanic counterparts
                                                                                       [9]. These disparities have led to a large body of research into reten-
1    INTRODUCTION                                                                      tion for minorities in STEM and specifically [8][1][15]. Bettinger
Among the most important aspects of a college education are the                        and Long researched the impact of female faculty on female re-
classes a student takes. Often, college students use introductory                      tention in majors or repeated interest in classes and found mixed
courses to decide what they would like to study and pursue. Bad ex-                    results: some disciplines such as statistics and mathematics bene-
periences in an introductory course might detract from a student’s                     fited from an early female professor introduction, while others saw
first impression of a field, while a good experience in a course might                 a decrease in female retention. The authors pointed out that it was
improve his or her opinion, even boosting retention and improving                      difficult to gauge the exact impact of female professors in fields
skills upon graduation [13]. Therefore, it is key that administrators                  that had low levels of females in faculty, such as engineering and
and professors alike understand which course characteristics main-                     physics. We hope to improve upon on this because George Mason’s
tain interest and improve student outcomes. Such information can                       School of Engineering female full time academic faculty make up
impact administrative decisions, such as who is assigned to teach                      26.8%, far surpassing the national average of 15.7% [16] [10].
particular courses and the recommended sequence of courses.                                The issue of student performance and retention extends beyond
    The digitization of student records and course evaluations offers                  under-represented minorities. Cucuringu et al. used fifteen years
a unique opportunity to apply big data modeling techniques to study                    of student data to find classes that optimized a student’s likelihood
retention. George Mason University, the data source for this work,                     of successfully completing a course of study with high grades [5].
keeps anonymized records on students’ academic records in high                         They also took the step of segmenting a student population into
school, demographic data, and their course loads and grades at the                     sub-groups based on various characteristics, so as to understand the
university. They also administer standardized course evaluations                       nuances that different types of students might experience. Morsy
across all courses. Various data mining and modeling techniques,                       and Karypis used a similarly broad, qualitative approach to predict
such as decision trees and support vector machines, can be applied                     student performance based on previous classes taken [14].
to these datasets and their results compared. Using this data, one                         Research specific to CS retention has also been conducted: Big-
                                                                                       gers et al. incorporated interviews of students who left CS, seeking
∗ Both authors contributed equally to this research.
                                                                                       to find the qualitative sentiments that affected both female and
                                                                                       male students’ decisions [2]. We combine these two approaches
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons   by using data on students’ individual demographics, grades, and
License Attribution 4.0 International (CC BY 4.0).                                     course history to understand how each factor may contribute to
Conference’17, July 2017, Washington, DC, USA                                                                                         Du and Plukis, et al.


both student performance and choice of major. Additionally, we           1 evaluation available for each unique course GMU offers. This data
incorporate student evaluations for the courses they take to un-         was collected from the GMU evaluation site 1 , which is publicly
derstand the role that these qualitative elements may play in these      available while on campus. As these are publicly available docu-
outcomes, as suggested by Biggers et al [2].                             ments on campus and the identifying features were anonymized,
   Research on course evaluations suggests they may prove in-            they are exempt research under GMU’s IRB policy 2 . To collect data
formative with regards to a student’s academic experience. Much          on professor gender, we reviewed pronoun usage in departmental
research has studied the relationship between the ease of a course,      documents and consulted faculty members when documentation
often represented by the grade a student receives, and the rating of     was insufficient.
the faculty. One well-known meta-analysis by Cohen argued that              The courses we describe as CS-adjacent are courses taught by or
students are fairly accurate in their assessments of instructional       in conjunction with the Department of Computer Science at GMU.
efficacy [6]. Centra’s study built upon this notion, and further em-     These CS-adjacent courses include Information Technology, Com-
phasized that students do not give higher evaluations to professors      puter Game Design, Software Engineering, Electrical and Computer
in a quid pro quo for higher grades: both extremely easy and dif-        Engineering, and Information Systems. After discarding course data
ficult courses suffered in student evaluations, while courses with       with no grades or grades not translating to the A-F scale and ap-
appropriate difficulty received the best evaluations [3]. Feldman        plying our course filters, we had records for 57,627 student-course
analyzed the contributory power of various teacher characteris-          enrollments.
tics to a teacher’s overall rating and student achievement, finding
that preparation, organization, clarity, and students’ feelings of       4.2      Definitions
engagement contributed most strongly to overall performance [11].        We frequently discuss “student success” within the computer sci-
He also highlighted some myths about student evaluations, citing         ence major. In this paper, our definition of “success” is divided into
research that suggests that they can, in fact, be informative. We        three categories:
incorporate evaluations in order to expand on these questions of            Completion of a computer science degree: A student is de-
student evaluation efficacy, and understand what they say about          fined as graduating with a computer science degree if he or she
students’ experiences and choices.                                       graduated with a major in either computer science or applied com-
                                                                         puter science. A student is defined as not graduating with a CS
3    PROBLEM DESCRIPTION                                                 degree if he or she graduated, but not with a CS major. Because
The objective of this study is to investigate a few questions relating   we are focused on retention, not graduation, we only included in
course quality—defined using faculty traits such as gender and           our data students who had had enough time to graduate. By not
instructional evaluations—to student retention in computer science.      including students who transfer or drop out of GMU, or who simply
Specifically, we will address the following inquiries:                   have not graduated yet, we reduced the number of confounding
   (1) Which course features, if any, in lower division CS courses       variables that are not directly related to students’ experiences in
       improve graduation retention for students?                        CS.
   (2) Which course features, if any, of instructors in introductory        Fulfilment of a student’s potential in a course: A student’s
       CS courses can predict student success in future CS courses?      “potential” in CS211 is defined as the term GPA of the semester
   (3) Do non-CS courses that are required by CS majors, like            in which CS112—the direct pre-requisite—was taken. Our interest
       calculus, have an impact on major retention for students? If      in this stems from its potential in combination with predictions
       so, which courses and features have the largest impact?           of passing a course. Students who perform below their “potential”
                                                                         within CS211, despite passing and receiving credit for the course,
4 MATERIALS                                                              might still benefit from administrator involvement. Alternatively,
                                                                         the characteristics of students performing above their potential may
4.1 Dataset                                                              highlight positive factors that should continue to be proliferated
Our dataset consisted of records containing first time freshman          on an institutional level.
student enrollment and course evaluation data for 20,825 George             Passing a course for credit: A student is defined as passing
Mason students over the span of eight years, from Summer 2009            a course for credit if he or she receives a C grade or above. At
to Fall 2018. All student data were collected and anonymized in          GMU, computer science BS students “must earn a C or better in
accordance with GMU’s Institutional Review Board policies. The           any course intended to satisfy a prerequisite for a computer science
student data contained demographics data such as age, sex, and           course ... [s]tudents may attempt an undergraduate course taught
race; admissions data such as high school, SAT score, and high           by the Volgenau School of Engineering twice.” 3 . In our research,
school GPA; and course data such as declared major, graduation           we specifically target student success in CS112 and CS211 because
year, courses taken, and grades received. Students who transferred       they are required courses for CS/ACS majors and pre-requisites for
into GMU were not included in this dataset because they likely           all other programming courses. Figure 1 visualizes the contrast in
had completed introductory courses at their previous institutions,       pass rates for first and second attempts in CS211: within our dataset,
rendering that first-year data inaccessible to us. We also collected
course evaluation data on 87,629 GMU courses from Summer 2009            1 https://irr2.gmu.edu/
                                                                         2 https://rdia.gmu.edu/topics-of-interest/human-or-animal-subjects/human-
to Spring 2019, 8,243 of which were computer science, or computer
                                                                         subjects/exempt-research/
science-adjacent courses. The evaluations are averages of all of the     3 https://catalog.gmu.edu/colleges-schools/engineering/computer-science/computer-
student evaluations for that specific course and section, so there is    science-bs/#admissionspoliciestext
Using Course Evaluations                                                                                Conference’17, July 2017, Washington, DC, USA


only 19.8% of students attempting CS211 for the first time did not        with 7,602 students. Lastly, we dropped all students with empty
receive credit, versus 63.3% of students on their second attempt.         values for any of the columns used in training. This left a dataset
                                                                          of 1,476 students who took at least one CS or CS-adjacent class
                                                                          before graduating. Of those, 330 graduated with a CS or ACS major,
                                                                          or 22.35%. This left us with an imbalanced dataset, leading to our
                                                                          decision to use F1 score and ROC AUC to characterize our models.
                                                                             For the grade prediction portion, students who received no
                                                                          grade—meaning they audited or did not complete the class—were
                                                                          not included in the data. This left 1,728 students who took both
                                                                          CS112 and CS211 at GMU at least once. In the cases where students
                                                                          took these courses multiple times, only the initial course attempt
                                                                          was used so as to only capture their original experience in the class.
                                                                          Predicting grades for only first attempts of CS211 offers an earlier
                                                                          flagging system for at-risk students.
                                                                             We wanted to understand the impact that not only general in-
              Figure 1: Receiving Credit for CS211.                       structor qualities, but also “exemplary” instructors, had on student
                                                                          grades. To that end, each grade prediction model was run with
5     METHODS                                                             the course evaluations processed in one of two ways: percentiles
For this work, we compared performance of predictive models that          or flags. Percentiles, which capture the general quality of an in-
were trained on three different sets of data, which are fully described   structor, had each evaluation entry into a percentile relative to the
in Appendix 9.2:                                                          other courses. Flags, which served to identify exemplary instruc-
                                                                          tors, transformed each entry into a binary feature based on whether
   (1) Baseline predictions based on high school performance and
                                                                          it was in the top 10% of evaluation scores in that category.
       student demographic data, as well as basic course informa-
                                                                             Although evaluations offer more data than can usually be gleaned
       tion such as the term in which a course was taken and a
                                                                          from student records, we tried to capture the elements in a course
       student’s GPA in that term.
                                                                          that cannot be captured in evaluations or records. We did so by cre-
   (2) Baseline features in addition to instructor gender and course
                                                                          ating unique course IDs for each course, so as to highlight especially
       evaluations for the classes, either CS-only or math and CS,
                                                                          good courses, good times of day for students, and good connec-
       taken by each student.
                                                                          tions between students in courses—all of which are not explicitly
   (3) Baseline features, plus the course numbers as unique iden-
                                                                          quantified in our data.
       tifiers that were distinct for each section and semester of
       a class, but common to all of the students who took that
       section.                                                           5.2    Experiments
We chose to use machine learning classifiers because they can often       As mentioned previously, we had three main groups of datasets.
pick up on more intricate patterns and correlations than linear and       The second group, which includes the course evaluation data, was
other basic statistical models can would. We decided to test these        then run on three different subsets: first, it was trained with just
distinct data sets because they each highlight a component of stu-        the "overall teaching" and "overall course" evaluation scores for
dent courses that may be significant to students’ performance and         the first CS and math courses, then the overall evaluations from
ultimate retention. The full list of features used in each experiment     the first two courses, then all available course evaluation metrics
are described in Appendix 9.                                              for the first two courses in each area. For graduation prediction,
We used seven classifiers from the Python sci-kit learn library:          both math and CS courses were included in the evaluation data in
Random Forest, Gradient Boosting, AdaBoost, SVC, Decision Tree,           order to capture a full snapshot of introductory courses. For grade
Neural Net, and Naive Bayes. For each of these models, we per-            prediction, only CS courses are included so as to not diminish the
formed 5-fold cross-validation, recording the resulting the averages      dataset of non-CS or non-STEM students, who often do not have
and standard deviations. In order to account for imbalances in our        the same rigorous math requirements.
dataset, we decided upon area under an ROC curve (ROC AUC)                   Our rationale in deploying some tests with just two course eval-
and F1 score as our main metrics, because they take into account          uation features per course was that the added dimensionality of
precision and recall in addition to overall accuracy.                     running the models on all of the features (many of which were
                                                                          positively correlated) might hinder performance. The baseline was
5.1     Pre-Processing                                                    meant to be the control for the predictive capabilities of only ba-
We consolidated student data for all students who took at least one       sic course features and student demographic information, so that
CS class, of whom there were 15,552. To better incorporate summer         subsequent tests might reveal how much predictive power the addi-
student data, we moved summer courses to the proceeding fall term.        tional data might have added. The full list of features used for each
Then, we calculated percentile values for students’ SAT scores and        of these experiments are listed in Appendix 9.1.
high school GPAs, enabling us to compare these metrics along                 All of our experiments deal with binary classification, and as such
a standard scale of 0 to 1. Next, for models predicting retention,        require binary flagging for the classes of interest. In grade prediction
we removed all students who had not yet graduated, leaving us             experiments, those who are at risk—of either not receiving credit
Conference’17, July 2017, Washington, DC, USA                                                                                           Du and Plukis, et al.


or not fulfilling their potential in a course—are flagged with a 1. In       (3) Fairness: Comparison between prediction of each academic
the graduation predictions, students who graduate with a computer                quartile versus prediction of all students
science or applied computer science degree are flagged with a 1.
   For each experiment, we ran 5-fold cross validation on our mod-        6.1    Baseline Performance
els, using a deterministic seed to generate our training-testing splits   Tables 1 and 2 show the baseline ability of each machine learning
so that we could directly compare splits before and after the mod-        model to predict student success without any course evaluation
els were trained. We performed Student’s t-tests on our results to        data. These models were trained and tested on only basic course
understand the significance of any differences in performance.            features, such as the term taken and number of students in the class,
                                                                          and student demographics.
5.3     Fairness
In order to check that the predictions were not favoring certain                                  Passing                            Potential
students already predisposed to graduating with a CS degree or              Classifier    F1       AUC            Acc         F1       AUC          Acc
passing their courses, we decided to separate the students into             Gradient     0.666     0.869         0.814    0.790        0.770      0.721
groups based on their academic abilities coming into college. We            Boosting     ±0.053    ±0.034        ±0.029   ±0.014       ±0.012     ±0.014
consider a prediction algorithm to be fair if its F1 score remains          Random       0.640     0.865         0.808    0.800        0.776      0.730
statistically similar regardless of the student’s quartile standing.         Forest      ±0.046    ±0.030        ±0.019   ±0.012       ±0.016     ±0.015
We used high school GPA (HS GPA) and total SAT scores to have               AdaBoost     0.641     0.849         0.803    0.777        0.746      0.710
one metric of school success and one metric of testing success to                        ±0.051    ±0.035        ±0.021   ±0.024       ±0.025     ±0.022
create a fuller understanding of student academic ability. These            Decision     0.637     0.825         0.798    0.787        0.737      0.710
two scores were transformed into percentiles, averaged together,              Tree       ±0.050    ±0.034        ±0.020   ±0.021       ±0.024     ±0.024
                                                                             Neural      0.623     0.853         0.795    0.779        0.764      0.712
then transformed into a percentile once more. This final percentile
                                                                                         ±0.064    ±0.032        ±0.032   ±0.007       ±0.027     ±0.010
calculation divided the students into evenly sized groups.
                                                                                SVC      0.598     0.835         0.794    0.782        0.743      0.697
   The students were then separated into 4 groups based on their                         ±0.049    ±0.037        ±0.021   ±0.009       ±0.032     ±0.014
percentile standings, as pictured in Figures 3 and 4. To test the            Naive       0.488     0.783         0.719    0.029        0.683      0.383
fairness implications, 5-fold splits were trained on all students and        Bayes       ±0.217    ±0.035        ±0.024   ±0.026       ±0.034     ±0.009
then tested only on certain quartiles. This way, we could clearly         Table 1: Predicting CS211 success—passing the class or
see any disparity in performance for all students versus those in         achieving one’s "potential" grade—from only student demo-
separate groups of students.                                              graphics and basic course features.


                                                                                                                  Graduating
                                                                                         Classifier         F1       AUC     Acc
                                                                                         Gradient      0.533         0.855         0.824
                                                                                         Boosting      ±0.037        ±0.007        ±0.011
                                                                                         Random        0.447         0.837         0.825
                                                                                          Forest       ±0.041        ±0.009        ±0.011
                                                                                         AdaBoost      0.563         0.839         0.823
                                                                                                       ±0.047        ±0.028        ±0.023
                                                                                          Decision     0.473         0.662         0.759
Figure 2: High School GPA versus SAT Total score for all non-                               Tree       ±0.018        ±0.012        ±0.015
transfer students who took both CS112 and CS211 at GMU.                                    Neural      0.486         0.762         0.735
                                                                                                       ±0.036        ±0.018        ±0.043
                                                                                            SVC         0.0          0.770         0.776
   We used these quartiles to test for fairness by training each of
                                                                                                        ±0.0         ±0.030        ±0.000
our models on the full datasets, splitting up the testing sets based
                                                                                           Naive       0.460         0.785         0.515
on the quartiles, and calculating the metrics based on these results.                      Bayes       ±0.009        ±0.029        ±0.018
We then compared these quartile results with the results for all
                                                                          Table 2: Predicting a CS211 success measure—graduating
students to determine if there was a significant difference between
                                                                          with a CS degree—from only student demographics and ba-
them, and therefore a disparity in fairness for differing groups.
                                                                          sic course features.
6     RESULTS
Our results are divided into three sections:
  (1) Performance metrics (F1 Score, ROC AUC, Accuracy) for our           6.2    Effect of Including Evaluation Data
      baseline models;                                                    Tables 3, 4, and 5 assess the difference in performance between the
  (2) Comparison between baseline models and models that in-              baseline models and those that incorporated evaluation and course
      clude course evaluation and other instructor data;                  data. The smallest p-values are in bold.
Using Course Evaluations                                                                                               Conference’17, July 2017, Washington, DC, USA


                                    Percentiles                         Flags                                Experiment       t Statistic   p-value
                              t Statistic p-value              t Statistic p-value                          1 Overall Eval     -0.6515       0.5334
       1 Overall Eval           -0.0319      0.9754               0.6862      0.5120                        2 Overall Evals    -0.7630       0.4781


                                                                                                      All
       2 Overall Evals           0.6796      0.5176               0.5370      0.6059                          2 Full Evals      0.2207       0.8318
 All


         2 Full Evals            0.1160      0.9105               0.1773      0.8637                         Discrete IDs      -0.5975       0.5669
        Discrete IDs             0.3738      0.7191               0.3738      0.7191                        1 Overall Eval      0.2620       0.8008
       1 Overall Eval            0.1700      0.8696              -0.1004      0.9227                        2 Overall Evals     0.2671       0.7964


                                                                                                      Q1
       2 Overall Evals          -0.3412      0.7425               0.4117      0.6930                          2 Full Evals      0.4025       0.6982
 Q1


         2 Full Evals           -0.0476      0.9632               0.0834      0.9357                         Discrete IDs       0.3470       0.7385
        Discrete IDs            -0.0594      0.9540              -0.0594      0.9541                        1 Overall Eval     -0.5466       0.5996
       1 Overall Eval           -0.2578      0.8034              -0.2765      0.7894                        2 Overall Evals    -0.7459       0.4787


                                                                                                      Q2
       2 Overall Evals          0.01077      0.9917              -0.3237      0.7549                          2 Full Evals     -0.3600       0.7284
 Q2


         2 Full Evals           -0.1091      0.9161              -0.2489      0.8097                         Discrete IDs      -0.1730       0.8670
        Discrete IDs             0.0749      0.9421               0.0749      0.9421                        1 Overall Eval     -0.7931       0.4557
       1 Overall Eval           -0.3398      0.7430              -0.1586      0.8784                        2 Overall Evals     0.0139       0.9893


                                                                                                      Q3
       2 Overall Evals           0.0113      0.9913              -0.0904      0.9302                          2 Full Evals     -0.1678       0.8714
 Q3


         2 Full Evals            0.0334      0.9742              -0.0587      0.9551                         Discrete IDs       0.2165       0.8352
        Discrete IDs             0.4269      0.6840               0.4269      0.6840                        1 Overall Eval     -0.2212       0.8311
       1 Overall Eval           -0.6431      0.5382              -1.1410      0.2892                        2 Overall Evals     0.4312       0.6778


                                                                                                      Q4
       2 Overall Evals           0.6485      0.5379               0.5716      0.5864                          2 Full Evals      0.5631       0.5889
 Q4


         2 Full Evals            0.1852      0.8585               0.5546      0.5943                         Discrete IDs      -0.5019       0.6293
        Discrete IDs            -0.3446      0.7404              -0.3446      0.7404
                                                                                         Table 5: Experimental models’ performance in predicting re-
Table 3: Experimental models’ performance in predicting                                  tention in the CS major, versus baseline models.
whether students passed CS211, versus baseline models.


                                                                                            In all of these t-tests, our null hypothesis was that evaluations
                                    Percentiles                         Flags            and specific courses taken by a student do not improve student
                              t Statistic p-value              t Statistic p-value       success predictions. If this were true, results from the baseline set of
       1 Overall Eval           -0.1432      0.8898               0.5000      0.6352     data would be the same as results that included course information
       2 Overall Evals           1.1452      0.2863              -0.5726      0.5831     because the course information would add no predictive power.
 All


         2 Full Evals            0.4472      0.6675              -0.6202      0.5549     None of our experiments proved to have a significant improvement
        Discrete IDs            1.5110      0.1695               1.5110       0.1695
                                                                                         over our baseline, so we fail to reject our null hypothesis and do
       1 Overall Eval           -0.0160      0.9877              -0.1531      0.8822
                                                                                         not find that evaluations improve predictions of student success.
       2 Overall Evals           0.2992      0.7728               0.0360      0.9722
 Q1


         2 Full Evals           -0.2371      0.8186               0.0559      0.9569
        Discrete IDs             0.3335      0.7475               0.3335      0.7475     6.3    Fairness Across Student Quartiles
       1 Overall Eval            0.2996      0.7724              -0.0688      0.9469     Tables 6, 7, and 8 show fairness t-tests. These are tests of whether
       2 Overall Evals           0.7417      0.4804              -0.1990      0.8473     the performance of each experimental model is better or worse at
 Q2


         2 Full Evals            0.1707      0.8688               0.2541      0.8061     predicting results for a specific quartile, versus predicting results
        Discrete IDs             1.1216      0.2947               1.1216      0.2947
                                                                                         for all students. They capture the statistical significance of discrep-
       1 Overall Eval            0.1039      0.9199               0.1922      0.8526
                                                                                         ancies in performance when run on different groups of students.
       2 Overall Evals           0.5626      0.5894               0.1778      0.8637
                                                                                            The null hypothesis in these tests is that there is no difference
 Q3


         2 Full Evals            0.7600      0.4716              -0.2082      0.8403
        Discrete IDs            1.1218      0.2946                .1218       0.2946     between the F1 scores for all students and those of each quartile.
       1 Overall Eval           -0.3295      0.7502              -0.4261      0.6815     In other words: the null hypothesis is that the predictions are fair.
       2 Overall Evals           0.2247      0.8279              -0.1384      0.8936     The lowest p-scores we found are in bold or, if they are statistically
 Q4


         2 Full Evals            0.3297      0.7512              -0.5162      0.6203     significant, are highlighted.
        Discrete IDs             0.2808      0.7862               0.2808      0.7862        Table 6 shows the models’ fairness in predicting whether stu-
Table 4: Experimental models’ performance in predicting                                  dents passed CS211.
whether students achieved their "potential" grade in CS211,                                 Table 7 shows fairness in predicting whether students achieved
versus baseline models.                                                                  their potential grades. This table differs much from Table 6 in that
                                                                                         many of the p-values listed here are significant at the 0.05 level. All
                                                                                         of the significant results are clustered within the first and second
                                                                                         quartiles, which are the bottom two quartiles in our groupings.
   The Percentiles column indicates evaluation scores were con-                             Table 8 shows models’ fairness in predicting whether students
verted to percentiles; the Flags column indicates binary flags of                        graduate with a CS major. While the significant t statistics in Table
the top 10% of scores were used.4                                                        7 were positive—indicating that the models perform best on the
4 Note that for models using Discrete IDs, we do not use numerical evaluation data, so   first and second quartiles—we see that performance for the lower
there is no distinction between the two categories’ results.                             two quartiles is negative. Additionally, t statistics are significantly
Conference’17, July 2017, Washington, DC, USA                                                                                  Du and Plukis, et al.


                                 Percentiles                 Flags                               type         t Statistic   p-value
                            t Statistic p-value     t Statistic p-value                        Baseline        -5.7063      0.0019
           Baseline           1.1476       0.2884     1.1476    0.2884                     1 Overall Eval      -3.4333      0.0195
                              0.9143       0.3899     0.6463    0.5443                                         -4.4204      0.0106


                                                                                      Q1
       1 Overall Eval                                                                      2 Overall Evals
                              1.3646       0.2097     1.2095    0.2610                                         -6.6634      0.0013
 Q1


       2 Overall Evals                                                                       2 Full Evals
         2 Full Evals         1.1995       0.2669    1.4597     0.1896                      Discrete IDs       -3.6290      0.0178
        Discrete IDs          1.0130       0.3471     1.0130    0.3471                         Baseline        -2.5044      0.0528
           Baseline          -0.2019       0.8475    -0.2019    0.8475                     1 Overall Eval      -2.7323      0.0380
                             -0.4798       0.6519    -0.2687    0.7957                                         -4.3013      0.0106


                                                                                      Q2
       1 Overall Eval                                                                      2 Overall Evals
                             -0.3005       0.7746    -0.3801    0.7179                                         -3.6916      0.0161
 Q2


       2 Overall Evals                                                                       2 Full Evals
         2 Full Evals        -0.1890       0.8568    -0.4047    0.6965                      Discrete IDs       -2.7675      0.0381
        Discrete IDs          0.3129       0.7640     0.3129    0.7640                         Baseline        -0.6357      0.5498
           Baseline           0.4792       0.6448     0.4792    0.6448                     1 Overall Eval      -1.4485      0.1859
                              0.3567       0.7316     0.3720    0.7197                                         -0.4967      0.6411


                                                                                      Q3
       1 Overall Eval                                                                      2 Overall Evals
                              0.2940       0.7772    -0.0392    0.9700                                         -1.3347      0.2360
 Q3


       2 Overall Evals                                                                       2 Full Evals
         2 Full Evals         0.5200       0.6172     0.2183    0.8327                      Discrete IDs       -0.1085      0.9166
        Discrete IDs          0.5267       0.6133     0.5267    0.6133                         Baseline         2.4488      0.0421
           Baseline          -0.6228       0.5560    -0.6228    0.5560                     1 Overall Eval       3.1471      0.0144
                             -0.3376       0.7453    -0.6068    0.5612                                          4.4240      0.0072


                                                                                      Q4
       1 Overall Eval                                                                      2 Overall Evals
                                                     -0.3570   0.730423                                         3.1356      0.0225
 Q4


       2 Overall Evals       -1.3819       0.2078                                            2 Full Evals
         2 Full Evals        -0.6432       0.5467    -0.5803    0.5793                      Discrete IDs        2.4960      0.0409
        Discrete IDs         -0.7653       0.4773    -0.7653    0.4773    Table 8: Fairness in experimental models’ predictions of
Table 6: Fairness in experimental models’ predictions of                  whether students graduated with a CS major.
whether students passed CS211.


                                 Percentiles                 Flags
                            t Statistic p-value     t Statistic p-value   7 DISCUSSION
           Baseline           2.5055       0.0557     2.5055    0.0557    7.1 Improvements Upon the Baseline
       1 Overall Eval         1.9566       0.1005     1.7817    0.1338
                                                                          Overall, our results show that adding evaluations to predictions
                              1.9229       0.1069     2.9422    0.0278
 Q1


       2 Overall Evals
         2 Full Evals         2.0311       0.0960   2.540979    0.0462
                                                                          does not significantly improve predictions over the baseline of
        Discrete IDs          2.0484       0.0858     2.0484    0.0858    student demographics and basic course features. The addition of
           Baseline           3.1270       0.0267     3.1270    0.0267    instructor gender, too, was not significant. Even when gender was
       1 Overall Eval         2.6275       0.0445     2.8256    0.0371    included for all courses used in predictions, the results did not
                              3.2574       0.0162     4.4184    0.0045    improve drastically. However, there were many semesters for which
 Q2


       2 Overall Evals
         2 Full Evals         4.5984       0.0019     3.2834    0.0196    there were no female instructors available at all to teach a course, so
        Discrete IDs          3.5205       0.0134     3.5205    0.0135    there could not be any direct comparisons between students with
           Baseline           0.0994       0.9246     0.0994    0.9246    male instructors and those with female instructors. Although there
       1 Overall Eval         0.4097       0.6943     0.3179    0.7602    are slight improvements for some experiments beyond the baseline,
                              0.3143       0.7647    -0.1453    0.8895
 Q3


       2 Overall Evals                                                    notably those involving the discrete and continuous unique IDs,
         2 Full Evals         0.2490       0.8111     0.6658    0.5346
                                                                          they do not reach a significance level of 0.05. This suggests that the
        Discrete IDs          0.3603       0.7305     0.3603    0.7305
                                                                          impact of student evaluation data and instructor gender on student
           Baseline          -2.0964       0.1004    -2.0964    0.1004
       1 Overall Eval        -2.4168       0.0650    -2.4568    0.0627    performance is not immediately visible.
                             -2.5259       0.0556    -2.3934    0.0695       Generally, our experiments performed better when predicting
 Q4


       2 Overall Evals
         2 Full Evals        -2.1430       0.0907    -2.3262    0.0736    whether students would achieve their "potential" grades than whether
        Discrete IDs         -2.5112       0.0585    -2.5112    0.0585    they would receive credit. When comparing p-values between pass-
Table 7: Fairness in experimental models’ predictions of                  ing and potential for all experiments, as in Tables 4, 5, and 6, pre-
whether students achieved their "potential" grade in CS211.               dicting student potential seems to improve upon predicting passing
                                                                          even when compared to their respective baselines. We attribute
                                                                          this to the fluid nature of a student’s forecast grade: if a students
                                                                          forecast grade is a C, then there are 3 possible grades that this
better for students in the top quartile. This suggests significant        student could get and still fulfill at or above his or her potential.
fairness disparities in these prediction models: the F1 scores of         Similarly, for students whose predicted grades are A’s, there is only
prediction across the entire student body overlap strongly with the       one possible grade, an A, with which they can achieve at or above
F1 scores across the third quartile, but vary widely from those of        their potential. This imbalance on both sides of the forecast grades
both stronger and poorer overall performers. This is in spite of the      means the models can make an easier prediction of achieving below
quartiles being represented in the overall dataset in equal numbers       a potential grade because there are generally more options on the
of data points.                                                           lower end of the grade scale than on the higher end.
Using Course Evaluations                                                                                           Conference’17, July 2017, Washington, DC, USA


   In addition, because the evaluations we have access to are only            majors. Given that GMU has unique student body, with many trans-
averages for all students in a course and do not reflect each stu-            fer students and non-traditional graduates, we would like to also
dent’s personal evaluation of the professor, each student in a section        include these students in a future analysis to track differences in
of a course would have the same evaluations. This large amount                their progressions through their majors. This also begs the question
of overlap between students, who then experienced different out-              of whether our results would be different at a school with more
comes with their success, seemed to negatively impact the models              four-year students. Despite the fact that our data does not indicate
in experiments where full sets of evaluations were used. This is-             that evaluations can improve predictions of student success, we are
sue was slightly assuaged with the use of unique IDs, but not at a            interested in the outcomes of research into this avenue at schools
significant level for most quartiles, see Tables 5, 6, and 7. For this rea-   with differing evaluation styles to see if these results can be im-
son, evaluation sets where evaluations are unique to each student             proved upon. In addition, the fairness concerns raised in this paper
would provide and interesting contrast to this work—individualized            around differing performances for students with varying academic
evaluations might provide high quality features for prediction.               statuses are of concern. We would like to see the improvement
                                                                              of grade prediction techniques both for all students and for each
7.2     Fairness                                                              quartile or minority demographic.
Our fairness results reach significance levels especially often in
the third and fourth quartiles—see Tables 8, 9, and 10—which are              ACKNOWLEDGMENTS
the lower academic quartiles. These quartile models underperform              This work was funded by the National Science Foundation as a
against the models for all students, frequently significantly. This           part of GMU’s Computer Science Research Experience for Under-
is a cause for concern–the students for which our models predict              graduates, and graciously supported by Dr. Karen Lee and GMU’s
well on, quartiles 1 and 2, are the quartiles in which students often         OSCAR office. We would also like to thank Drs. Mark Snyder and
already perform well. There are a few reasons that might contribute           Huzefa Rangwala for their guidance and knowledge. Finally, we
to this underperformance on the lower quartile of students. One is            thank GMU for the use of their facilities and this opportunity.
that our quartiles are artificially created—although high school GPA
and SAT scores are indicators of academic success in high school,             REFERENCES
they do not necessarily represent the same success in college. In              [1] E. P. Bettinger and B. T. Long. 2005. Do Faculty Serve as Role Models? The Impact
addition, we split students into quartile depending on the percentile              of Instructor Gender on Female Students. The American Economic Review 95, 2
                                                                                   (2005), 152–157. http://www.jstor.org/stable/4132808
of their averaged HS GPA and SAT scores, not on any visible clusters           [2] Maureen Biggers, Anne Brauer, and Tuba Yilmaz. 2008. Student Perceptions of
within the data. These artificial clusters might not represent true                Computer Science: A Retention Study Comparing Graduating Seniors with Cs
student groups.                                                                    Leavers. SIGCSE Bull. 40, 1 (mar 2008), 402–406. https://doi.org/10.1145/1352322.
                                                                                   1352274
                                                                               [3] John A. Centra. 2003. Will Teachers Receive Higher Student Evaluations by
8     CONCLUSION                                                                   Giving Higher Grades and Less Course Work? Research in Higher Education 44, 5
                                                                                   (2003), 495–518. https://doi.org/10.1023/A:1025492407752
Our data suggests there is a pressing need to understand how stu-              [4] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
dents of different academic calibers experience the same curricula,                2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial
                                                                                   intelligence research 16 (2002), 321–357.
given the disparities in their ultimate outcomes. It also suggests that        [5] Chantal Cherifi, Hocine Cherifi, MÃąrton Karsai, and Mirco Musolesi (Eds.). 2018.
evaluations of courses, at least as they are structured in our data,               Complex Networks & Their Applications VI: Proceedings of Complex Networks 2017
                                                                                   (The Sixth International Conference on Complex Networks and Their Applications).
do not offer significant insights as to how a student will perform                 Studies in Computational Intelligence, Vol. 689. Springer International Publishing,
or whether he or she will remain in computer science. Lastly, we                   Cham. https://doi.org/10.1007/978-3-319-72150-7
find that the starting course number or code may have some pre-                [6] Peter A. Cohen. 1981. Student Ratings of Instruction and Student Achievement:
                                                                                   A Meta-Analysis of Multisection Validity Studies. Review of Educational Research
dictive power, suggesting that different courses may significantly                 51, 3 (1981), 281–309. https://doi.org/10.2307/1170209
impact the outcomes of students. The question now becomes one of               [7] Paulo Cortez and Alice Silva. 2008. Using data mining to predict secondary school
identifying how we measure the different features of these courses.                student performance. EUROSIS (01 2008).
                                                                               [8] Benjamin J. Drury, John Oliver Siy, and Sapna Cheryan. 2011. When Do Female
   There are several possible expansions on our methodology. As                    Role Models Benefit Women? The Importance of Differentiating Recruitment
previously mentioned, our data is imbalanced, and using techniques                 From Retention in STEM. Psychological Inquiry 22, 4 (2011), 265–269. https:
                                                                                   //doi.org/10.1080/1047840X.2011.620935
such as cost-aware training, oversampling, undersampling, or Syn-              [9] Office of Institutional Effectiveness and Planning. 2019. Degrees Conferred By
thetic Minority Oversampling Technique [4] might enable a more                     Degree And Demographic - Year 2017-18, All Terms. http://irr2.gmu.edu/New/N_
balanced weighting of results and greater accuracy in identifying                  Degree/DegDegreeDetail.cfm
                                                                              [10] Office of Institutional Effectiveness and Planning. 2019. Full-Time Academic
points in the minority classes. Another proposed fairness-oriented                 Faculty Demographic Profiles Two-Year Comparisons. http://irr2.gmu.edu/New/N_
metric is the Absolute Between-ROC Area metric, which measures                     Faculty/FullTimeFacComp.cfm
the absolute area between two ROC curves. In doing so, it measures            [11] Kenneth A. Feldman. 1996. Identifying exemplary teaching: Using data from
                                                                                   course and teacher evaluations. New Directions for Teaching and Learning 1996,
disparities in prediction across every possible decision threshold,                65 (Mar 1996), 41–50. https://doi.org/10.1002/tl.37219966509
as opposed to just one[12]. Lastly, we would like to grid-search for          [12] Josh Gardner, Christopher Brooks, and Ryan Baker. 2019. Evaluating the Fairness
                                                                                   of Predictive Student Models Through Slicing Analysis. In Proceedings of the 9th
hyperparameters that optimize F1 score and ROC AUC, rather than                    International Conference on Learning Analytics & Knowledge (LAK19). ACM, New
accuracy.                                                                          York, NY, USA, 225–234. https://doi.org/10.1145/3303772.3303791
   In addition to course evaluations for computer science classes,            [13] Jeff Kastner Gregory Warren Bucks, Kathleen A. Ossman and F James Boerio.
                                                                                   2015. First-year Engineering Courses’ Effect on Retention and Workplace Perfor-
we also scraped course evaluations for other classes. In the future,               mance. In 2015 ASEE Annual Conference & Exposition. ASEE Conferences, Seattle,
we hope to use this dataset to apply such retention analysis to all                Washington. https://peer.asee.org/24114.
Conference’17, July 2017, Washington, DC, USA                                            Du and Plukis, et al.


[14] Sara Morsy and George Karypis. 2017. Cumulative knowledge-based regression
     models for next-term grade prediction. Proceedings of the 17th SIAM International
     Conference on Data Mining, SDM 2017 (jan 2017), 552–560. http://www.scopus.
     com/inward/record.url?scp=85027876583&partnerID=8YFLogxK
[15] Laird Townsend. 1994. How Universities Successfully Retain and Graduate
     Black Students. The Journal of Blacks in Higher Education 4 (1994), 85–89. http:
     //www.jstor.org/stable/2963380
[16] Brian L. Yoder. 2017. Engineering by the Numbers. Engineering College Pro-
     files & Statistics Book (2017). http://www.asee.org/papers-and-publications/
     publications/college-profiles/15EngineeringbytheNumbersPart1.pdf
Using Course Evaluations                                                                                   Conference’17, July 2017, Washington, DC, USA


A     PREDICTION FEATURES
Table 9 shows the features used for each of the predictions, categorized by the type of experiment being described.

               Features                Meaning                                               Baseline   Overall SET     All SET     IDs
                                       Categorical variable, includes option for no race
               Race                                                                             ✕           ✕              ✕         ✕
                                       listed.
                                       Categorical variable: male, female, and no gender
               Sex                                                                              ✕           ✕              ✕         ✕
                                       listed.
               High School GPA         Continuous variable.                                     ✕           ✕              ✕         ✕
                                       Continuous variable, out of 1600. Empty cells are
               SAT Total Score                                                                  ✕           ✕              ✕         ✕
                                       filled in with the median of the total SAT scores.
                                       Continuous variable, out of 800. Empty cells are
               SAT Verbal Score        filled in with the median of the total SAT verbal        ✕           ✕              ✕         ✕
                                       scores.
                                       Continuous variable, out of 800. Empty cells are
               SAT Math Score          filled in with the median of the total SAT math          ✕           ✕              ✕         ✕
                                       scores
                                       Continuous variable, between 0 and 1. Average of
               Average Percentile      the HS GPA and SAT Total percentiles for each            ✕           ✕              ✕         ✕
                                       student.
                                       Continuous variable, indicates the term in which
               Class Term Taken        the student took the course used for prediction and      ✕           ✕              ✕         ✕
                                       the course being predicted.
                                       Continuous variable, the non-cumulative GPA for
               Term GPA                the term in which the student took the course used       ✕           ✕              ✕         ✕
                                       for prediction and the course being predicted.
               Instructor Gender       Binary variable, split between male and female.                      ✕              ✕         ✕
                                       Continuous variable, the grade received in the
               Grade Points                                                                     ✕           ✕              ✕         ✕
                                       course used for predicting the second course.
                                       Continuous or binary, depending on the treatment
                                       of the specific test—flagging or percentiles. These
               Overall Evaluations                                                                          ✕              ✕
                                       are defined as SET (as seen in Appendix 9.2) ques-
                                       tions 15 and 16.
                                       Continuous or binary, depending on the treatment
                                       of the specific test—flagging or percentiles. These
               All Evaluations                                                                                             ✕
                                       are defined as SET (as seen in Appendix 9.2) ques-
                                       tions 1 through 14.
                                       Binary, represents the unique course taken by a
                                       student: ID is discipline, course number, section
               Course ID               number, term taken, and binary digit indicating                                               ✕
                                       a summer term. Students in the same course and
                                       section will all have a 1.
                                           Table 9: The features used for each experiment.


B    GMU’S STUDENT EVALUATION OF TEACHING (SET)
Each of these sections were rated on a scale of 1 to 5, with a NA option available. Questions 15 and 16 are the “overall” evaluations used in
certain experiments.
   (1) Course requirements and expectations were clear.
   (2) The course was well organized.
   (3) The instructor helped me to better understand the course material.
   (4) Feedback (written comments and suggestions on papers, solutions provided, class discussion, etc.) was helpful.
   (5) The instructor showed respect for the students.
   (6) The instructor was accessible either in person or electronically.
   (7) The course grading policy was clear.
   (8) Graded work reflected what was covered in the course.
   (9) The assignments (projects, papers, presentations, etc.) helped me learn the material.
  (10) The textbook and/or assigned readings helped me understand the material.
Conference’17, July 2017, Washington, DC, USA                                                                                               Du and Plukis, et al.


  (11) Assignments and exams were returned in a reasonable amount of time.
  (12) The instructor covered the important aspects of the course as outlined in the syllabus.
  (13) The instructor made the class intellectually stimulating.
  (14) The instructor encouraged the students to be actively involved in the material through discussion, assignments, and other activities.
  (15) My overall rating of the teaching.
  (16) My overall rating of this course.

C     EXTENDED GRADE PREDICTION RESULTS

                                                                                    Percentiles                      Top 10% Flags
                                                    type      model            F1      AUC      Acc             F1       AUC       Acc
                                                  1 overall   Gradient     0.665       0.872      0.809      0.689       0.875      0.826
                                                              Boosting     ±0.046     ±0.034     ±0.029     ±0.053      ±0.032     ±0.032
                                     Passing


                                                  2 overall   Gradient     0.686       0.880      0.822      0.684       0.874      0.821
                                                              Boosting     ±0.039     ±0.032     ±0.022     ±0.053      ±0.033     ±0.030
                                                    2 full    Gradient     0.670       0.876      0.812      0.672       0.875      0.815
                                                              Boosting     ±0.056     ±0.034     ±0.027     ±0.054      ±0.032     ±0.029
                                                  1 overall   Random       0.799       0.770      0.728      0.803       0.773      0.734
                                                               Forest      ±0.010     ±0.018     ±0.013     ±0.006      ±0.017     ±0.008
                                     Potential


                                                  2 overall   Random       0.808       0.787      0.741      0.796       0.773      0.725
                                                               Forest      ±0.010     ±0.018     ±0.013     ±0.010      ±0.017     ±0.014
                                                    2 full    Random       0.804       0.788      0.739      0.796       0.771      0.722
                                                               Forest      ±0.016     ±0.030     ±0.019     ±0.008      ±0.013     ±0.010
Table 10: Highest performing models from each of the experiments, evaluation treatments, and grade prediction styles. The
best performers in each grade prediction style block are highlighted.


The experiment that improved upon the baseline power of prediction most utilizes unique course IDs to represent individual courses taken.
The results of this type of experiment are displayed in Table 2. Table 2 contains the results of the ID experiments predicting student grades.
The top performing models in Table 2 outperform the baseline predictive powers in F1, AUC, and accuracy measures, and the significance of
these experiments is explored in Table 4.

                                                                                    Passing                          Potential
                                                     Classifier           F1         AUC        Acc        F1          AUC        Acc
                                                 Gradient Boosting       0.677      0.876      0.821      0.811        0.801      0.748
                                                                         ±0.039     ±0.029     ±0.015     ±0.013      ±0.015     ±0.017
                                                     AdaBoost            0.666      0.858      0.810      0.794        0.788      0.736
                                                                         ±0.039     ±0.024     ±0.018     ±0.015      ±0.023     ±0.013
                                                    Neural Net           0.664      0.850      0.812      0.763        0.761      0.697
                                                                         ±0.044     ±0.031     ±0.016     ±0.025      ±0.031     ±0.033
                                                  Random Forest          0.639      0.878      0.814      0.811        0.804      0.745
                                                                         ±0.059     ±0.022     ±0.023     ±0.011      ±0.014     ±0.014
                                                        SVC              0.633      0.849      0.815      0.772        0.749      0.697
                                                                         ±0.060     ±0.029     ±0.021     ±0.016      ±0.022     ±0.017
                                                   Decision Tree         0.605      0.802      0.800      0.789        0.739      0.719
                                                                         ±0.068     ±0.047     ±0.021     ±0.019      ±0.015     ±0.025
                                                   Naive Bayes           0.485      0.550      0.377      0.284        0.673      0.466
                                                                         ±0.005     ±0.010     ±0.014     ±0.051      ±0.032     ±0.021
            Table 11: predicting passing 211 from unique IDs for both courses and using either continuous or discrete

</pre>