=Paper= {{Paper |id=Vol-1183/ncfpal_paper05 |storemode=property |title=Mining the Web to Leverage Collective Intelligence and Learn Student Preferences |pdfUrl=https://ceur-ws.org/Vol-1183/ncfpal_paper05.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/MorettiGM14 }} ==Mining the Web to Leverage Collective Intelligence and Learn Student Preferences== https://ceur-ws.org/Vol-1183/ncfpal_paper05.pdf

Mining the Web to Leverage Collective Intelligence and
Learn Student Preferences

Antonio Moretti† , José P. González-Brenes? , Katherine McKnight†
†
Center for Educator Learning & Effectiveness
?
Center for Digital Data, Analytics & Adaptive Learning
Research & Innovation Network, Pearson
{antonio.moretti, jose.gonzalez-brenes, kathy.mcknight}@pearson.com

ABSTRACT The field of educational data mining has been cultivating
University professors of conventional offline classes are often a strong interest in creating technologies to mine data col-
experts in their research fields, but have little training on lected from sophisticated online systems such as intelligent
educational sciences. Current educational data mining tech- tutoring systems, virtual learning environments, and recently
niques offer little support to them. In this paper we propose from Massive Open Online Courses (MOOC). The merits
a novel algorithm, Analyzing CurrIculum Decisions (ACID), of these complex online systems have been demonstrated
that leverages collective intelligence to model student opin- empirically [2, 8] with controlled studies. MOOCs are a
ions to help instructors of traditional classes. ACID mines powerful resource that allow educators to study student be-
publicly available educational websites, such as student rat- havior and social learning in a controlled environment, how-
ings of professors and course information, and learns student ever the scope of the impact of such technologies is lim-
opinions within a statistical framework. We demonstrate ited. For example, a recent survey of active MOOC users
ACID to discover patterns in learner feedback and factors in 200 countries and territories revealed that an overwhelm-
that affect Computer Science instruction. Specifically, we ingly majority of students on these courses correspond to
investigate the choice of a programming language for intro- the most educated elite of their respective countries [3]. It
ductory courses, the grading criteria and the posting of a is clear that improving basic education worldwide is neces-
publicly available online syllabus. sary before MOOCs can deliver their promise. Moreover,
because most education still happens offline, it is impor-
tant to provide educational technologies that can utilize the
Keywords power of internet to understand student behavior and to de-
offline teacher support, collective intelligence, web mining liver these technologies to traditional offline classes. It is not
clear how existing educational data mining technologies can
1. INTRODUCTION help bridge this divide.
There are thousands of undergraduates in computer science
programs throughout the US, roughly 24% of whom will We discuss the Analyzing CurrIculum Decisions(ACID) [11]
switch majors to non-computing fields [7]. An essential methodology, which has been presented and applied briefly.
component of retaining students is the quality of instruc- In this paper we elaborate on both our methodology and
tion that students receive in introductory courses [7]. While statistical model and expand upon our results. ACID is an
clear instruction and good pedagogy are widely acknowl- algorithm that leverages collective intelligence within a sta-
edged as fundamental to retention, supports for instructors tistical framework. ACID supports the decisions of instruc-
to improve their educational practice are often based on old tors of traditional offline courses by extracting from the web
data; the languages used in computer science courses quickly teaching syllabi data, and using crowd-sourcing to pair it
evolve and old surveys are not useful. In this paper, we de- up with students’ course ratings, comments and sentiment
velop a data mining technique that will help provide insight to analyze the relationship between the two.
into learner feedback which can be translated into changes
that affect course quality. In general, our approach is similar This paper reports a case study of using the ACID method-
to large scale surveys that attempt to be representative of ology to explore three questions that instructors of com-
student populations. The benefits of our approach are that puter science courses face when designing their courses. In
it is rapid and inexpensive due to its use of publicly available addition we discuss ACID’s heuristic value within a larger
information on the Web. educational framework. We address the following questions:

1. What course activities and grading rubric cor-
relate with clear instruction? The question of how
to design a grading rubric and weight course activities
determines what students focus on within a course. It
is important for instructors to optimize course activ-
ities and grading criteria with respect to the student
experience.
Algorithm 1 ACID pseucode
n universities to analyze, z reviews to analyze
procedure ACID
while |R| < z do
s ← sample of n universities
s ← Remove non-English speaking universities
R ← Search The Web For Reviews(s)
R ← ratings rated by more than students
Figure 1: Two Examples from the Ratings Sample
Q ← CrowdSource Questionnaire(R)
Analyze Data(Q)
Table 1: Statistics for the Ratings Sample
Easiness Helpfulness Clarity Interest
2. For introductory classes, which programming Mean 2.84 3.30 3.24 3.35
language(s) correlate with clear instruction? Aca- Std. Dev. 1.33 1.62 1.59 4.00
demics and industry professionals disagree as to the Median 3.00 4.00 4.00 1.38
programming language that is best suited for begin-
ners [16]. For example, some argue that introductory
courses should use interpreted languages that allow for
a faster understanding of the applications of program- home-works and exams. We make use of this data to
ming rather than compiled languages that rely heavily infer teaching strategies.
on language-specific syntax. Others believe that de-
veloping skill with compiled languages is necessary for • Student perceptions of the course. We make use
future work in computer science. The choice of a first of self-selected student evaluations collected from a
programming language likely affects students’ decision third-party website. The validity and usefulness of self-
to continue education within the field of computer sci- selected online rating systems, have been assessed in
ence. the literature [1, 12]. For example, evidence suggests
that online ratings do not lead to substantially more
3. Are students more interested in courses with biased ratings than those done in a traditional class-
publicly available online syllabi? The choice to room setting [1] and that online ratings are a proxy
make a syllabus publicly available adds to information to measure student learning [12]: student learning can
available to prospective students on the Web. We hy- often be modeled as a latent variable that causes pat-
pothesize that the posting of an online syllabus can be terns of observed faculty ratings. Researchers hypoth-
used as a proxy for factors including instructor orga- esize a non-linear or concave relationship between stu-
nization and motivation, and that students will both dent learning and the perceived difficulty level of a
be more interested in and prefer these courses. course [12]; students learn most when a course is not
too difficult or too easy. Our work relies on self-selected
ratings as a metric to study learner opinion.
The rest of this paper is organized as follows. § 2 explains
the ACID methodology; § 3 describes three case studies of
evaluating teaching decisions using ACID; § 4 relates to prior We use publicly available self-selected ratings of professors
work; § 5 concludes. from a third-party website, Rate My Professor 1 (RMP).
This site allows students to rate the professors of the courses
2. ANALYZING CURRICULUM DECISIONS they have taken. The database contains data from over 13
Pseudocode for the ACID methodology is presented in Al- million ratings for 1.5 million professors. They collect rat-
gorithm 1. For a given number of reviews, we sample n ings on a 1—5 scale (being 1 the lowest possible score, and 5
universities, remove the non-English speaking universities, the highest) under the categories of “easiness”, “helpfulness”
scrape and parse the relevant reviews from a ratings website and “clarity.” Additionally students may fill out an “inter-
and retain ratings rated by more than a given number of est” field in which they indicate how appealing the class was
students. We then extract information from these courses before enrolling, and a 350 character summary of their class
using crowd-sourcing, and analyze the data. We describe experience. We focus on perceived clarity because of the
the process in detail below. direct link between clarity and quality of instruction.

To evaluate the relative impact of different course features, For the purposes of this paper, we focus on Computer Sci-
we mine the web for data that reflect: ence courses due to our familiarity with the content. Since
we do not have access to the ratings database, we develop
• Curriculum decisions University professors often up- a process to sample data from the website. For this, we
load information about their classes. This information first select a random sample of 50 international universities
is targeted towards prospective or enrolled students. that teach Computer Science from the Academic Ranking of
This information includes syllabi with detailed descrip-
1
tions of course material such as textbooks, projects, ratemyprofessor.com
World Universities2 [14]. From this sample we only consider
the 41 universities are English speaking. Table 2: Respondent Validation
Accuracy Interrater Agreement
We find, scrape and parse the reviews of the ratings data-set Masters 100% 96.67%
for all professors within the computer science departments of non-Masters 85.56% 6.07%
the universities in our sample. We remove the ratings from
faculty that were rated by fewer than 30 students. More
than one professor can teach the same course. For our anal-
ysis, we describe one course listing taught by two different
data for our purposes. We ask respondents to find the syl-
professors as two separate courses. Table 1 shows the mean,
labus corresponding to a random sample of 30 courses and
standard deviation and median of the ratings in our sample.
to answer a set of questions. Table 2 shows the accuracy
Figure 1 shows two sample ratings for one professor from our
and interrater agreement of Masters and non-Masters level
sample. The professor name and course names are removed
respondents.
for privacy.
In the pretest we used a screening question to evaluate the
We use Amazon Mechanical Turk, a crowdsourcing platform,
accuracy of respondents’ data on each task. We asked re-
to find course features for each of the courses in our ratings
spondents to find the URL of the website of a randomly se-
sample. We do this by asking respondents to fill out a sur-
lected faculty member at Carnegie Mellon University from
vey. The survey requests to provide the URL for the online
a set of 8, from which we knew the answer. We compared
syllabus that corresponds to the course and professor from
the URL they provided with the correct URL to assess ac-
which we have ratings that is closest to the date of the stu-
curacy. Of the 13 responses of non-masters workers that
dent review online. Then, using the syllabus, respondents
did not provide an exact URL match, five responses left the
are asked to to provide the programming language(s) used,
validation question blank. We found that respondents with
the textbook(s) used, and the percentage of the grade that
master level qualification were significantly more accurate
was determined by homework, projects, quizzes, exams and
(i.e. answered the validation item correctly) than the non-
whether the course was taught online or in a blended format
Masters level respondents (p-value = 0.0002).
(both face-to-face and online). However, when we reviewed
the responses to the blended format question, it appeared
Additionally, we tested interrater agreement by asking 3
that most syllabi did not provide enough information by
respondents to carry out the same task, i.e. finding the
which to make an accurate response.
same URL (for a total of 3x30 or 90 tasks). We used a
dummy variable to code whether the three respondents pro-
From our original sample of 1,112 courses taught by a unique
vided the same URL for the course syllabus. Our measure
professor, respondents find an online syllabus matching the
of agreement is calculated by taking the proportion of total
professor for 342 courses (∼31%). We hypothesize three ex-
responses in which all three respondents provide the same
planations for the missing syllabi: (i) the syllabi may be
URL. Masters-level respondents agreed (i.e. all three pro-
accessed only with a password through a course manage-
vided the same URL) 100% of the time, whereas the non-
ment system, such as blackboard, (ii) the syllabi may not
Masters level respondents performed much worse – only 6%
be available only, or (iii) the respondents are not able to
agreed. As a result of these comparisons, we decided to hire
find the syllabi.
only Masters-level respondents to complete the crowdsourc-
ing experiment.
3. DATA ANALYSIS: WHAT MAKES A BET-
TER CLASS? After collecting the data using Masters level respondents, we
We report our results of applying the ACID methodology to performed a post-hoc analysis by examining the responses
evaluate teaching decisions. In § 3.1 we assess the quality of to the screening question. From the final group of 342 re-
the data collected by the crowd sourcing platform. In § 3.2 sponses that provided a link to an online syllabus, 325 re-
we discuss the statistical model we use. In § 3.3 we report sponses (95.03%) provided the correct URL for the faculty
the results of using ACID. website. It should be noted that 13 of the 17 responses that
did not provide an exact URL match provided the website
for a different faculty member from the set of 8, suggesting
3.1 Data Quality that they copied and pasted their previous response with-
We now report the how we attempt to collect high-quality
out checking to see that the prompt had changed for the
data through the use of crowd-sourcing and how we assess
new response. Two of the 17 responses provided a link to
the quality of our data.
the directory website for the faculty member rather than the
faculty member’s personal website. One response provided
Mechanical Turk provides a “master” qualification level to
the correct faculty member’s website within the department
respondents that are more reliable. Masters-level respon-
of Statistics rather than the department of Computer Sci-
dents require higher compensation for crowd-sourcing tasks
ence (the faculty member is in both departments).
than non-masters level respondents although their “accep-
tance rate,” or proportion of approved tasks is much higher.
We ran a preliminary experiment, to decide whether respon-
dents on master level qualification provide better quality 3.2 Model
2 We describe our general linear mixed model. We provide
Academic Ranking of World Universities is also known as
Shanghai Ranking shanghairanking.com descriptive statistics and model selection criteria.
Table 3: VPC and ICC Statistics Professor Residual Standard Errors

University Professor Course

1.5
VPC 0.0646 0.3365 0.2355
ICC 0.0728 0.3425 0.1982

1.0
conditional modes of residual error

0.5
0.0
We explore the relationship between student reviews and
features collected from online syllabus data using general

−0.5
linear mixed modeling. Student reviews are organized at
three levels: by university, professor and course. It is im-

−1.0
portant to note the non-independence of the student reviews
due to the hierarchical or clustered nature of the data. We

−1.5
suspect that student ratings within each course, professor
and perhaps university are correlated. We begin by esti-

−2.0
mating the amount of variance attributed to each of these
0 50 100 150
three levels. The simplest multilevel model does not yet
professor rank
include explanatory variables:
yi,j = β0 + u0,j + i,j (1)
Figure 2: 95% CI for Professor Residual Error
The dependent variable yi,j is the clarity rating that student
i gave to level j. The term β0 represents the intercept or
mean student clarity rating across all observations. The
term u0,j represents the mean clarity rating for level j. The dard deviations higher than the mean clarity rating. The
term i,j represents the error attributed to student rating i red horizontal line refers to the “average” professor.
at level j. For comparison we fit a null or single-level model:
We calculate a Chi-squared likelihood ratio statistic by tak-
yi,j = β0 + i,j (2) ing the difference between log likelihood values of two suc-
We calculate the percentage of variation in the data set that cessive models. We begin by comparing the null model and
is separately attributed to each of the three levels of the data. the course level model to compare the significance of includ-
Conventionally the variance partition coefficient (VPC) and ing the course effect. We continue by adding each of the
intraclass correlation coefficient (ICC) can be interpreted additional effects. We do not report the values of the test
similarly to an R-squared term and are reported in Table 3. statistic although all additional levels of complexity are sta-
tistically significant. We consider the Bayesian information
σe2 criterion (BIC) and Akaike information criterion (AIC) as
ρ=1− (3) model selection tools to avoid over-fitting the data. The
σe + σu2
2
BIC and AIC penalize the log-likelihood of a model for the
The VPC and ICC are denoted by ρ, the residual variance inclusion of extra parameters. The parameters are estimated
is denoted by σe2 and the variance of the effect is denoted using restricted maximum likelihood estimation (REML).
by σu2 . The ICC is a statistic that is similar to the VPC.
However, since the parameter values of the within and be- We choose the model with the minimum BIC. A two-level
tween level variance are estimated using sample data, there mixed model including course effect and professor effect pro-
may be bias due to sampling variation, particularly when vides the optimal Bayesian information criterion value. Two
there are fewer observations within a given level. The ICC and three way interaction effects were considered although
as described by Bartko [1] corrects for this bias by making they did not decrease the AIC or BIC of any of the mod-
a small computational adjustment.3 Observe that the ICC els. While the log likelihood value is maximized by including
term appears to give slightly less weight to the course effect. the university effect, a simpler model is preferable because
It is clear from both statistics that the main effect is the it involves fewer parameter estimates and is more likely to
professor effect. generalize. The model can be written in matrix form:

We examine the professor level-residuals and their associ- Y = Xβ + Zν + (4)
ated standard errors to look for variation in clarity ratings
across professors. The caterpillar plot displays the professor Y denotes the response variable observations (student rat-
residuals in rank order together with 95% confidence inter- ings). The matrix β represents a vector of fixed-effects
vals. Wider intervals occur for professors with more student parameters with a design matrix X. Z is a design ma-
reviews. Observe that the majority of the intervals do not trix of indicator variables denoting group membership across
overlap and thus there are significant differences between random-effect levels and ν is a vector containing random-
professors. The blue circles on the far left represent profes- effect parameters. is a vector of error terms.
sors who are rated two standard deviations below the mean
clarity rating, whereas those on the far right are 1.5 stan-
3.3 Case Studies
3 We show the results of using the ACID methodology to an-
For a description of the computation of the ICC, see the
documentation and source code for the R library lme. swer three course design questions.
Optimizing the Number of Clusters
Table 4: Programming Language Statistics

8000
Value Std.Err t-value Pr<|t| n Bayesian Information Criterion
Akaike Information Criterion
C 3.38 0.32 10.58 0.0000 109
C++ 3.30 0.31 10.65 0.0000 214

7900
Java 3.62 0.19 19.33 0.0000 353
Python 3.70 0.26 14.50 0.0000 133

Information Criterion

7800
Scheme 4.06 0.47 8.61 0.0000 32
Scratch 3.91 0.84 4.67 0.0000 49

7700
7600
3.3.1 For introductory classes, which programming
language do students associate with clear in-

7500
struction?
Professors teaching introductory level courses in computer 2 3 4 5 6 7 8 9

science choose between a number of programming languages Number of Clusters

and textbooks. We make use of the data collected to provide
insights into which programming languages beginning stu-
dents associate with clear instruction. We filter the data to Figure 3: Information Criterion
only include introductory level courses (one which does not Optimizing the Number of Clusters
require any prerequisite coursework in computer science).

−3700
Our restricted sample includes 1,024 reviews; 34.58% of all
reviews with syllabus data are of introductory courses. We

−3750
explore the relationship between clarity ratings and pro-
gramming language with random professor and course ef-

−3800
fects. Programming languages with less than 30 student re-
views are not reported4 . Table 4 gives the estimates for stu-
Log−Likelihood

−3850
dent ratings of clarity by programming language and their
associated p-values. An intercept is not modeled in order
−3900

to make the results easily interpretable. The mean clarity
rating for introductory courses is 3.599.
−3950

We found C and C++ had the lowest coefficients (i.e. com-
−4000

piled languages had the lowest perceived clarity ratings).
Scheme and Scratch have the highest clarity ratings followed 1 2 3 4 5 6 7 8
by Python and Java. We note that the standard errors are
Number of Clusters
largest for Scheme and Scratch and smallest for Java and
Python. This suggests that results for Java and Python
are stronger. Students in our sample associate clearer in- Figure 4: Log Likelihood
struction with interpreted languages rather than compiled
languages. Also, both Python and Java are associated with
clearer instruction than C or C++.
Table 5: Cluster Statistics
HW Projects Exams Quizzes Other
3.3.2 What mix of course activities – exams, quizzes, Cluster1 18.11 2.36 76.66 0.61 2.25
homework and projects – do students associate Cluster2 20.59 7.90 48.90 12.46 10.15
Cluster3 7.00 40.18 46.23 3.51 3.08
with clear instruction?
Cluster4 42.93 0.76 54.61 0.70 2.00
To assess students’ course ratings of clarity based on the
percentage of the grade due to exams, quizzes, homework
and projects, we created a factor made up of four clusters
representing four ways of weighting homework, projects, ex-
ams, quizzes and miscellaneous (such as extra credit) for Table 6: Grading Criteria Statistics
the students’ grade. We begin by sorting the data to only
include observations in which the grading criteria (percent- Clarity Std.Err t-value Pr<|t| n
age of the grade determined by homework, projects, exams, Exam Heavy 3.23 0.12 26.91 0 726
quizzes and miscellaneous) is available and sums to 100. Of Equal Mix 3.52 0.14 26.04 0 484
the 2,935 observations with syllabus data, there are 2,225 ob- Exam Proj 3.65 0.13 27.76 0 610
servations with full grading criteria. The difference in these Exam HW 3.12 0.13 23.53 0 415
numbers represents 710 ratings for which the respondents
4 were not able to find a complete grade breakdown from the
SQL is a special purpose programming language used only
for relational databases and is not reported. online syllabus.
We use k-means clustering to partition the 2,225 observa-
tions with complete grading criteria information based on Table 7: Online Syllabi
the five aforementioned variables. We optimize k, our num-
ber of clusters, by examining how the BIC and AIC of the Clarity Std. Err t-value Pr<|t| n
mixture model change based on the number of clusters se- Available 3.33 0.07 44.48 0 2953
lected. Figure 3 displays the information criterion and Fig- Not Found 3.26 0.07 46.03 0 7702
ure 4 displays the log-likelihood values for each number of
clusters respectively. A solution involving two clusters min-
imizes the BIC of the model, whereas a four cluster solution
minimizes the AIC. The log likelihood is optimized with the Research has recently focused on online faculty ratings with
four cluster solution. We consider both two and four cluster mixed conclusions. Felton et al. [4] found that online instruc-
models as optimal and we find that they lend themselves to tor ratings were associated with perceived easiness, and that
similar interpretation. The cluster means for the four cluster a “halo effect” existed in which raters gave high scores to in-
solution are presented in table 5. structors perhaps because their courses were easier. We find
that student ratings of clarity and easiness are correlated
The first cluster represents courses that are heavily weighted (ρ=0.45) although not as strongly associated as clarity and
towards exams with a smaller weight towards homework. helpfulness. We do find that student ratings of clarity and
The second cluster represents a more even weighting of ex- helpfulness are highly correlated (ρ=0.84). We chose to fo-
ams, homework, projects and quizzes. The third cluster rep- cus on clarity ratings as we assumed these were less suscep-
resents an equal weighting towards exams and projects. The tible to a “halo effect” and other bias relative to the overall
fourth cluster represents courses that are heavily weighted ratings of a course or professor. Otto et al [13] found issues
towards exams and homework. The cluster membership is related to bias in online ratings stating that online ratings
treated as a predictor variable and modeled using equation are characterized by selection bias as anyone can enter fac-
4. Table 6 displays the estimated clarity ratings within each ulty ratings at any time. Carini et al [1], Hardy [5], McGhee
group for the four cluster solution. and Lowell [6] had contradictory results finding that an on-
line format did not lead to more biased ratings. Otto et
The exams and projects cluster has the highest estimate of al. [12] hypothesized that instructor clarity and helpfulness
clarity. We find that weighting projects equally with exams as captured by Rate My Professor are more positively asso-
is associated with a clearer course experience. The equal ciated with student learning than easiness.
mix cluster also is associated with higher clarity estimates.
The exam heavy cluster and the exam and homework heavy Several approaches have been proposed to synthesize re-
clusters are associated with lower student clarity ratings. We sponses using crowd sourcing systems such as Amazon’s
find that a rubric that weights exams and projects evenly has Mechanical Turk. Majority voting is perhaps the simplest
higher perceived clarity ratings to a rubric which is weighted way to combine crowd responses using equal weights irre-
heavily towards exams and homework. This result extends spective of respondent experience. The results of our pre-
to both two and four cluster solutions. liminary analysis in accessing the accuracy of non-Masters
level respondents correspond to the steep drop in respon-
dent accuracy noted by Karger [9] when low-quality respon-
3.3.3 Does the posting of a syllabus online translate dents are present. Whitehill et al [15] proposed a proba-
into higher ratings? bilistic model for combining crowd responses called Genera-
We hypothesize the posting of the syllabus online is a proxy tive model of Labels, Abilities and Difficulties (GLAD). The
for organization, perhaps motivation or drive of the profes- GLAD methodology makes use of the EM algorithm to cal-
sor. We make use of all of the data collected to compare stu- culate parameter estimates of unobserved variables includ-
dent reviews of professors who have a publicly available syl- ing an approximation of the expertise of the rater. Khattak
labus and of those who do not. Many professors may choose and Salleb-Aouissi compared the accuracy and percentage
to only post a syllabus through course management systems of bad responses using majority voting, probabilistic mod-
that require a password. Potential students of these courses els, and their novel approach entitled Expert Label Injected
are unable to access the syllabus to determine whether the Crowd Estimation (ELICE) [10]. ELICE makes use of a few
course would be a good fit. We treat the posting of an online “ground truth” responses and incorporates expertise of the
syllabus as a factor and test for differences in clarity ratings labeler, difficulty of the instance and an aggregation of la-
between the two groups using our model. bels. Khattak and Salleb-Aouissi found that their approach
was robust and outperformed GLAD and iterative methods
We find statistically significant differences between clarity, even when bad labelers were present. Our simple approach
helpfulness and interest ratings and report the clarity es- was to use Masters level respondents from Mechanical Turk
timates for the two groups in Table 7. We note that the although GLAD and ELICE are alternative methods to re-
difference in easiness ratings is not statistically significant. duce the number of expert level respondents required while
We find evidence that students are more interested in pro- also obtaining high quality data.
fessors and courses in which the syllabus is made publicly
available. We note that the parameter estimates for the two 5. CONCLUSIONS, LIMITATIONS AND FU-
groups are within one standard error of one another which TURE WORK
suggests that the conclusions are modest. We demonstrate how the Analyzing CurrIculum Decisions
(ACID) methodology can be used to leverage collective in-
4. RELATION TO PRIOR WORK telligence and learn student preferences. In introductory
computer science courses, we find that students that are 8140:94–109, 2013.
taught interpreted languages find their classes clearer. We [11] A. Moretti, J. Gonzalez-Brenes, and K. McKnight.
also that find students who are given an even weighting of Towards data–driven curriculum design: Mining the
exams and projects find their classes clearer; and that in- web to make better teaching decisions. EDM, 2014.
terest in a course corresponds to the availability of an on- [12] J. Otto, D. A. Sanford Jr, and D. N. Ross. Does
line syllabus. Our study does not necessarily suggest that ratemyprofessor. com really rate my professor?
teachers should change their programming language. Fur- Assessment & Evaluation in Higher Education,
ther research is needed before drawing causal inferences. We 33(4):355–368, 2008.
argue that ACID is a beneficial tool to discover patterns in [13] J. Otto, D. A. Sanford Jr, and W. Wagner. Analysis of
student behavior. Syllabus data and course ratings data are online student ratings of university faculty. Journal of
becoming increasingly available on the Web. This data is College Teaching & Learning, 2(7):25–30, 2005.
used by millions of students and worthy of further research. [14] Shanghai. Academic ranking of world universities.
Retrieved from http://www.shanghairanking.com/,
This study can be expanded in several ways. Student eval- Accessed at 2013 12 01.
uations often include free form text where students can de-
[15] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and
scribe their experience in the course. Sentiment analysis is
J. Movellan. Whose vote should count more: Optimal
a probabilistic approach for categorizing student comments
integration of labels from labelers of unknown
as being either positive or negative. One extension is to
expertise. Neural Information Processing Systems,
regress text sentiment on course features. There is arguably
pages 2035–2043, 2009.
a strong association between comment sentiment and stu-
dent preference. Another way ACID can be applied is to [16] J. Zelle. Python as a first language. Retrieved from
disciplines other than computer science, or to discover pat- http://mcsp.wartburg.edu/zelle/python/python-
terns in syllabi across disciplines that can provide insight first.html/, Accessed at 2014 02
into learner experiences. 23.

APPENDIX
6. REFERENCES A. SAMPLE OF UNIVERSITIES SELECTED
[1] R. Carini, J. Hayek, G. Kuh, J. Kennedy, and Country n Professors n Courses n Reviews
J. Ouimet. College student responses to web and Colorado State USA 1 9 32
paper surveys: does mode matter? Research in Higher Carnegie Mellon University USA 3 21 102
North Carolina State USA 2 10 63
Education, 44(1):1–19, 2003. Pennsylvania State USA 12 74 938
[2] A. Corbett. Cognitive computer tutors: Solving the Rensselaer Polytechnic Institute USA 3 22 131
Rutgers USA 8 30 468
two-sigma problem. In M. Bauer, P. Gmytrasiewicz, Simon Fraser Canada 27 98 1873
and J. Vassileva, editors, User Modeling 2001, volume SUNY Stony Brook USA 8 55 505
UC Davis USA 10 44 589
2109 of Lecture Notes in Computer Science, pages UNC Chapel Hill USA 1 4 49
137–147. Springer Berlin Heidelberg, 2001. University of Alberta Canada 2 6 69
University of Arizona USA 3 13 158
[3] E. J. Emanuel. Online education: Moocs taken by University of Delaware USA 15 56 806
educated few. Nature, 503(7476):342–342, 2013. University of Florida Gainsville USA 5 36 321
University of Illinois at Urbana USA 5 14 339
[4] J. Felton and J. Mitchell. Web based student University of Massachusetts USA 6 39 405
evaluations of professors: the relations between University of Montreal USA 1 6 59
perceived quality, easiness and sexiness. Assessment University of Toronto Canada 14 66 775
University of Utah USA 2 17 66
and Evaluation in Higher Education, 29(1):91–108, University of Virginia USA 3 19 131
2004. University of Waterloo Canada 46 125 2700
Vanderbilt University USA 2 10 76
[5] N. Hardy. Online ratings: fact and fiction. New
Directions for Teaching and Learning, (96):31–38,
2003.
[6] N. Hardy. Psychometric properties of student ratings
of instruction in online and on-campus courses. New
Directions for Teaching and Learning, 2003(96):39–48,
2003.
[7] M. Haungs, C. Clark, J. Clements, and D. Janzen.
Improving first-year success and retention through
internet-based cs0 courses. ACM SIGCSE, pages
549–594, 2012.
[8] S. Jaggars and T. Bailey. Effectiveness of fully online
courses for college students: Response to a department
of education meta-analysis. Teachers College:
Community College Research Center, 2010.
[9] S. Karger, D. Oh and D. Shah. Budget–optimal task
allocation for reliable crowdsourcing systems. CoRR,
arXiv:1110.3564, 2011.
[10] F. Khattak and A. Salleb-Aouissi. Robust crowd
labeling using little experience. Discovery Science,