=Paper= {{Paper |id=Vol-1183/ncfpal_paper05 |storemode=property |title=Mining the Web to Leverage Collective Intelligence and Learn Student Preferences |pdfUrl=https://ceur-ws.org/Vol-1183/ncfpal_paper05.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/MorettiGM14 }} ==Mining the Web to Leverage Collective Intelligence and Learn Student Preferences== https://ceur-ws.org/Vol-1183/ncfpal_paper05.pdf
     Mining the Web to Leverage Collective Intelligence and
                  Learn Student Preferences

                     Antonio Moretti† , José P. González-Brenes? , Katherine McKnight†
                                          †
                                          Center for Educator Learning & Effectiveness
                                   ?
                                       Center for Digital Data, Analytics & Adaptive Learning
                                            Research & Innovation Network, Pearson
                 {antonio.moretti, jose.gonzalez-brenes, kathy.mcknight}@pearson.com

ABSTRACT                                                            The field of educational data mining has been cultivating
University professors of conventional offline classes are often     a strong interest in creating technologies to mine data col-
experts in their research fields, but have little training on       lected from sophisticated online systems such as intelligent
educational sciences. Current educational data mining tech-         tutoring systems, virtual learning environments, and recently
niques offer little support to them. In this paper we propose       from Massive Open Online Courses (MOOC). The merits
a novel algorithm, Analyzing CurrIculum Decisions (ACID),           of these complex online systems have been demonstrated
that leverages collective intelligence to model student opin-       empirically [2, 8] with controlled studies. MOOCs are a
ions to help instructors of traditional classes. ACID mines         powerful resource that allow educators to study student be-
publicly available educational websites, such as student rat-       havior and social learning in a controlled environment, how-
ings of professors and course information, and learns student       ever the scope of the impact of such technologies is lim-
opinions within a statistical framework. We demonstrate             ited. For example, a recent survey of active MOOC users
ACID to discover patterns in learner feedback and factors           in 200 countries and territories revealed that an overwhelm-
that affect Computer Science instruction. Specifically, we          ingly majority of students on these courses correspond to
investigate the choice of a programming language for intro-         the most educated elite of their respective countries [3]. It
ductory courses, the grading criteria and the posting of a          is clear that improving basic education worldwide is neces-
publicly available online syllabus.                                 sary before MOOCs can deliver their promise. Moreover,
                                                                    because most education still happens offline, it is impor-
                                                                    tant to provide educational technologies that can utilize the
Keywords                                                            power of internet to understand student behavior and to de-
offline teacher support, collective intelligence, web mining        liver these technologies to traditional offline classes. It is not
                                                                    clear how existing educational data mining technologies can
1.   INTRODUCTION                                                   help bridge this divide.
There are thousands of undergraduates in computer science
programs throughout the US, roughly 24% of whom will                We discuss the Analyzing CurrIculum Decisions(ACID) [11]
switch majors to non-computing fields [7]. An essential             methodology, which has been presented and applied briefly.
component of retaining students is the quality of instruc-          In this paper we elaborate on both our methodology and
tion that students receive in introductory courses [7]. While       statistical model and expand upon our results. ACID is an
clear instruction and good pedagogy are widely acknowl-             algorithm that leverages collective intelligence within a sta-
edged as fundamental to retention, supports for instructors         tistical framework. ACID supports the decisions of instruc-
to improve their educational practice are often based on old        tors of traditional offline courses by extracting from the web
data; the languages used in computer science courses quickly        teaching syllabi data, and using crowd-sourcing to pair it
evolve and old surveys are not useful. In this paper, we de-        up with students’ course ratings, comments and sentiment
velop a data mining technique that will help provide insight        to analyze the relationship between the two.
into learner feedback which can be translated into changes
that affect course quality. In general, our approach is similar     This paper reports a case study of using the ACID method-
to large scale surveys that attempt to be representative of         ology to explore three questions that instructors of com-
student populations. The benefits of our approach are that          puter science courses face when designing their courses. In
it is rapid and inexpensive due to its use of publicly available    addition we discuss ACID’s heuristic value within a larger
information on the Web.                                             educational framework. We address the following questions:


                                                                      1. What course activities and grading rubric cor-
                                                                         relate with clear instruction? The question of how
                                                                         to design a grading rubric and weight course activities
                                                                         determines what students focus on within a course. It
                                                                         is important for instructors to optimize course activ-
                                                                         ities and grading criteria with respect to the student
                                                                         experience.
Algorithm 1 ACID pseucode
n universities to analyze, z reviews to analyze
procedure ACID
     while |R| < z do
         s ← sample of n universities
         s ← Remove non-English speaking universities
         R ← Search The Web For Reviews(s)
         R ← ratings rated by more than  students
                                                                  Figure 1: Two Examples from the Ratings Sample
     Q ← CrowdSource Questionnaire(R)
     Analyze Data(Q)
                                                                         Table 1: Statistics for the Ratings Sample
                                                                                Easiness Helpfulness Clarity Interest
     2. For introductory classes, which programming                   Mean        2.84      3.30         3.24   3.35
        language(s) correlate with clear instruction? Aca-            Std. Dev.   1.33      1.62         1.59   4.00
        demics and industry professionals disagree as to the          Median      3.00      4.00         4.00   1.38
        programming language that is best suited for begin-
        ners [16]. For example, some argue that introductory
        courses should use interpreted languages that allow for
        a faster understanding of the applications of program-           home-works and exams. We make use of this data to
        ming rather than compiled languages that rely heavily            infer teaching strategies.
        on language-specific syntax. Others believe that de-
        veloping skill with compiled languages is necessary for        • Student perceptions of the course. We make use
        future work in computer science. The choice of a first           of self-selected student evaluations collected from a
        programming language likely affects students’ decision           third-party website. The validity and usefulness of self-
        to continue education within the field of computer sci-          selected online rating systems, have been assessed in
        ence.                                                            the literature [1, 12]. For example, evidence suggests
                                                                         that online ratings do not lead to substantially more
     3. Are students more interested in courses with                     biased ratings than those done in a traditional class-
        publicly available online syllabi? The choice to                 room setting [1] and that online ratings are a proxy
        make a syllabus publicly available adds to information           to measure student learning [12]: student learning can
        available to prospective students on the Web. We hy-             often be modeled as a latent variable that causes pat-
        pothesize that the posting of an online syllabus can be          terns of observed faculty ratings. Researchers hypoth-
        used as a proxy for factors including instructor orga-           esize a non-linear or concave relationship between stu-
        nization and motivation, and that students will both             dent learning and the perceived difficulty level of a
        be more interested in and prefer these courses.                  course [12]; students learn most when a course is not
                                                                         too difficult or too easy. Our work relies on self-selected
                                                                         ratings as a metric to study learner opinion.
The rest of this paper is organized as follows. § 2 explains
the ACID methodology; § 3 describes three case studies of
evaluating teaching decisions using ACID; § 4 relates to prior    We use publicly available self-selected ratings of professors
work; § 5 concludes.                                              from a third-party website, Rate My Professor 1 (RMP).
                                                                  This site allows students to rate the professors of the courses
2.     ANALYZING CURRICULUM DECISIONS                             they have taken. The database contains data from over 13
Pseudocode for the ACID methodology is presented in Al-           million ratings for 1.5 million professors. They collect rat-
gorithm 1. For a given number of reviews, we sample n             ings on a 1—5 scale (being 1 the lowest possible score, and 5
universities, remove the non-English speaking universities,       the highest) under the categories of “easiness”, “helpfulness”
scrape and parse the relevant reviews from a ratings website      and “clarity.” Additionally students may fill out an “inter-
and retain ratings rated by more than a given number of           est” field in which they indicate how appealing the class was
students. We then extract information from these courses          before enrolling, and a 350 character summary of their class
using crowd-sourcing, and analyze the data. We describe           experience. We focus on perceived clarity because of the
the process in detail below.                                      direct link between clarity and quality of instruction.

To evaluate the relative impact of different course features,     For the purposes of this paper, we focus on Computer Sci-
we mine the web for data that reflect:                            ence courses due to our familiarity with the content. Since
                                                                  we do not have access to the ratings database, we develop
     • Curriculum decisions University professors often up-       a process to sample data from the website. For this, we
       load information about their classes. This information     first select a random sample of 50 international universities
       is targeted towards prospective or enrolled students.      that teach Computer Science from the Academic Ranking of
       This information includes syllabi with detailed descrip-
                                                                  1
       tions of course material such as textbooks, projects,          ratemyprofessor.com
World Universities2 [14]. From this sample we only consider
the 41 universities are English speaking.                                   Table 2: Respondent Validation
                                                                                    Accuracy Interrater Agreement
We find, scrape and parse the reviews of the ratings data-set          Masters       100%      96.67%
for all professors within the computer science departments of          non-Masters   85.56%    6.07%
the universities in our sample. We remove the ratings from
faculty that were rated by fewer than 30 students. More
than one professor can teach the same course. For our anal-
ysis, we describe one course listing taught by two different
                                                                 data for our purposes. We ask respondents to find the syl-
professors as two separate courses. Table 1 shows the mean,
                                                                 labus corresponding to a random sample of 30 courses and
standard deviation and median of the ratings in our sample.
                                                                 to answer a set of questions. Table 2 shows the accuracy
Figure 1 shows two sample ratings for one professor from our
                                                                 and interrater agreement of Masters and non-Masters level
sample. The professor name and course names are removed
                                                                 respondents.
for privacy.
                                                                 In the pretest we used a screening question to evaluate the
We use Amazon Mechanical Turk, a crowdsourcing platform,
                                                                 accuracy of respondents’ data on each task. We asked re-
to find course features for each of the courses in our ratings
                                                                 spondents to find the URL of the website of a randomly se-
sample. We do this by asking respondents to fill out a sur-
                                                                 lected faculty member at Carnegie Mellon University from
vey. The survey requests to provide the URL for the online
                                                                 a set of 8, from which we knew the answer. We compared
syllabus that corresponds to the course and professor from
                                                                 the URL they provided with the correct URL to assess ac-
which we have ratings that is closest to the date of the stu-
                                                                 curacy. Of the 13 responses of non-masters workers that
dent review online. Then, using the syllabus, respondents
                                                                 did not provide an exact URL match, five responses left the
are asked to to provide the programming language(s) used,
                                                                 validation question blank. We found that respondents with
the textbook(s) used, and the percentage of the grade that
                                                                 master level qualification were significantly more accurate
was determined by homework, projects, quizzes, exams and
                                                                 (i.e. answered the validation item correctly) than the non-
whether the course was taught online or in a blended format
                                                                 Masters level respondents (p-value = 0.0002).
(both face-to-face and online). However, when we reviewed
the responses to the blended format question, it appeared
                                                                 Additionally, we tested interrater agreement by asking 3
that most syllabi did not provide enough information by
                                                                 respondents to carry out the same task, i.e. finding the
which to make an accurate response.
                                                                 same URL (for a total of 3x30 or 90 tasks). We used a
                                                                 dummy variable to code whether the three respondents pro-
From our original sample of 1,112 courses taught by a unique
                                                                 vided the same URL for the course syllabus. Our measure
professor, respondents find an online syllabus matching the
                                                                 of agreement is calculated by taking the proportion of total
professor for 342 courses (∼31%). We hypothesize three ex-
                                                                 responses in which all three respondents provide the same
planations for the missing syllabi: (i) the syllabi may be
                                                                 URL. Masters-level respondents agreed (i.e. all three pro-
accessed only with a password through a course manage-
                                                                 vided the same URL) 100% of the time, whereas the non-
ment system, such as blackboard, (ii) the syllabi may not
                                                                 Masters level respondents performed much worse – only 6%
be available only, or (iii) the respondents are not able to
                                                                 agreed. As a result of these comparisons, we decided to hire
find the syllabi.
                                                                 only Masters-level respondents to complete the crowdsourc-
                                                                 ing experiment.
3.    DATA ANALYSIS: WHAT MAKES A BET-
      TER CLASS?                                                 After collecting the data using Masters level respondents, we
We report our results of applying the ACID methodology to        performed a post-hoc analysis by examining the responses
evaluate teaching decisions. In § 3.1 we assess the quality of   to the screening question. From the final group of 342 re-
the data collected by the crowd sourcing platform. In § 3.2      sponses that provided a link to an online syllabus, 325 re-
we discuss the statistical model we use. In § 3.3 we report      sponses (95.03%) provided the correct URL for the faculty
the results of using ACID.                                       website. It should be noted that 13 of the 17 responses that
                                                                 did not provide an exact URL match provided the website
                                                                 for a different faculty member from the set of 8, suggesting
3.1   Data Quality                                               that they copied and pasted their previous response with-
We now report the how we attempt to collect high-quality
                                                                 out checking to see that the prompt had changed for the
data through the use of crowd-sourcing and how we assess
                                                                 new response. Two of the 17 responses provided a link to
the quality of our data.
                                                                 the directory website for the faculty member rather than the
                                                                 faculty member’s personal website. One response provided
Mechanical Turk provides a “master” qualification level to
                                                                 the correct faculty member’s website within the department
respondents that are more reliable. Masters-level respon-
                                                                 of Statistics rather than the department of Computer Sci-
dents require higher compensation for crowd-sourcing tasks
                                                                 ence (the faculty member is in both departments).
than non-masters level respondents although their “accep-
tance rate,” or proportion of approved tasks is much higher.
We ran a preliminary experiment, to decide whether respon-
dents on master level qualification provide better quality       3.2   Model
2                                                                We describe our general linear mixed model. We provide
 Academic Ranking of World Universities is also known as
Shanghai Ranking shanghairanking.com                             descriptive statistics and model selection criteria.
           Table 3: VPC and ICC Statistics                                                                                  Professor Residual Standard Errors

                 University Professor Course




                                                                                                                 1.5
          VPC     0.0646     0.3365   0.2355
          ICC     0.0728     0.3425   0.1982




                                                                                                                 1.0
                                                                          conditional modes of residual error

                                                                                                                 0.5
                                                                                                                 0.0
We explore the relationship between student reviews and
features collected from online syllabus data using general




                                                                                                                 −0.5
linear mixed modeling. Student reviews are organized at
three levels: by university, professor and course. It is im-




                                                                                                                 −1.0
portant to note the non-independence of the student reviews
due to the hierarchical or clustered nature of the data. We




                                                                                                                 −1.5
suspect that student ratings within each course, professor
and perhaps university are correlated. We begin by esti-




                                                                                                                 −2.0
mating the amount of variance attributed to each of these
                                                                                                                        0      50              100           150
three levels. The simplest multilevel model does not yet
                                                                                                                                       professor rank
include explanatory variables:
                    yi,j = β0 + u0,j + i,j                  (1)
                                                                      Figure 2: 95% CI for Professor Residual Error
The dependent variable yi,j is the clarity rating that student
i gave to level j. The term β0 represents the intercept or
mean student clarity rating across all observations. The
term u0,j represents the mean clarity rating for level j. The       dard deviations higher than the mean clarity rating. The
term i,j represents the error attributed to student rating i       red horizontal line refers to the “average” professor.
at level j. For comparison we fit a null or single-level model:
                                                                    We calculate a Chi-squared likelihood ratio statistic by tak-
                        yi,j = β0 + i,j                     (2)    ing the difference between log likelihood values of two suc-
We calculate the percentage of variation in the data set that       cessive models. We begin by comparing the null model and
is separately attributed to each of the three levels of the data.   the course level model to compare the significance of includ-
Conventionally the variance partition coefficient (VPC) and         ing the course effect. We continue by adding each of the
intraclass correlation coefficient (ICC) can be interpreted         additional effects. We do not report the values of the test
similarly to an R-squared term and are reported in Table 3.         statistic although all additional levels of complexity are sta-
                                                                    tistically significant. We consider the Bayesian information
                                   σe2                              criterion (BIC) and Akaike information criterion (AIC) as
                       ρ=1−                                  (3)    model selection tools to avoid over-fitting the data. The
                                σe + σu2
                                 2
                                                                    BIC and AIC penalize the log-likelihood of a model for the
The VPC and ICC are denoted by ρ, the residual variance             inclusion of extra parameters. The parameters are estimated
is denoted by σe2 and the variance of the effect is denoted         using restricted maximum likelihood estimation (REML).
by σu2 . The ICC is a statistic that is similar to the VPC.
However, since the parameter values of the within and be-           We choose the model with the minimum BIC. A two-level
tween level variance are estimated using sample data, there         mixed model including course effect and professor effect pro-
may be bias due to sampling variation, particularly when            vides the optimal Bayesian information criterion value. Two
there are fewer observations within a given level. The ICC          and three way interaction effects were considered although
as described by Bartko [1] corrects for this bias by making         they did not decrease the AIC or BIC of any of the mod-
a small computational adjustment.3 Observe that the ICC             els. While the log likelihood value is maximized by including
term appears to give slightly less weight to the course effect.     the university effect, a simpler model is preferable because
It is clear from both statistics that the main effect is the        it involves fewer parameter estimates and is more likely to
professor effect.                                                   generalize. The model can be written in matrix form:

We examine the professor level-residuals and their associ-                                                                   Y = Xβ + Zν +                        (4)
ated standard errors to look for variation in clarity ratings
across professors. The caterpillar plot displays the professor      Y denotes the response variable observations (student rat-
residuals in rank order together with 95% confidence inter-         ings). The matrix β represents a vector of fixed-effects
vals. Wider intervals occur for professors with more student        parameters with a design matrix X. Z is a design ma-
reviews. Observe that the majority of the intervals do not          trix of indicator variables denoting group membership across
overlap and thus there are significant differences between          random-effect levels and ν is a vector containing random-
professors. The blue circles on the far left represent profes-      effect parameters.  is a vector of error terms.
sors who are rated two standard deviations below the mean
clarity rating, whereas those on the far right are 1.5 stan-
                                                                    3.3                                         Case Studies
3                                                                   We show the results of using the ACID methodology to an-
 For a description of the computation of the ICC, see the
documentation and source code for the R library lme.                swer three course design questions.
                                                                                                                      Optimizing the Number of Clusters
    Table 4: Programming Language Statistics




                                                                                               8000
           Value Std.Err t-value Pr<|t|  n                                                                  Bayesian Information Criterion
                                                                                                            Akaike Information Criterion
   C        3.38   0.32   10.58  0.0000   109
   C++      3.30   0.31   10.65  0.0000   214




                                                                                               7900
   Java     3.62   0.19   19.33  0.0000   353
   Python   3.70   0.26   14.50  0.0000   133




                                                                       Information Criterion

                                                                                               7800
   Scheme   4.06   0.47   8.61   0.0000   32
   Scratch  3.91   0.84   4.67   0.0000   49




                                                                                               7700
                                                                                               7600
3.3.1    For introductory classes, which programming
         language do students associate with clear in-




                                                                                               7500
         struction?
Professors teaching introductory level courses in computer                                              2        3          4         5        6        7        8     9

science choose between a number of programming languages                                                                          Number of Clusters

and textbooks. We make use of the data collected to provide
insights into which programming languages beginning stu-
dents associate with clear instruction. We filter the data to                                          Figure 3: Information Criterion
only include introductory level courses (one which does not                                                           Optimizing the Number of Clusters
require any prerequisite coursework in computer science).




                                                                                               −3700
Our restricted sample includes 1,024 reviews; 34.58% of all
reviews with syllabus data are of introductory courses. We




                                                                                               −3750
explore the relationship between clarity ratings and pro-
gramming language with random professor and course ef-

                                                                                               −3800
fects. Programming languages with less than 30 student re-
views are not reported4 . Table 4 gives the estimates for stu-
                                                                       Log−Likelihood

                                                                                               −3850
dent ratings of clarity by programming language and their
associated p-values. An intercept is not modeled in order
                                                                                               −3900




to make the results easily interpretable. The mean clarity
rating for introductory courses is 3.599.
                                                                                               −3950




We found C and C++ had the lowest coefficients (i.e. com-
                                                                                               −4000




piled languages had the lowest perceived clarity ratings).
Scheme and Scratch have the highest clarity ratings followed                                            1        2          3         4        5        6        7     8
by Python and Java. We note that the standard errors are
                                                                                                                                  Number of Clusters
largest for Scheme and Scratch and smallest for Java and
Python. This suggests that results for Java and Python
are stronger. Students in our sample associate clearer in-                                                  Figure 4: Log Likelihood
struction with interpreted languages rather than compiled
languages. Also, both Python and Java are associated with
clearer instruction than C or C++.
                                                                                                         Table 5: Cluster Statistics
                                                                                                        HW    Projects Exams Quizzes                                       Other
3.3.2    What mix of course activities – exams, quizzes,          Cluster1                              18.11  2.36      76.66    0.61                                     2.25
         homework and projects – do students associate            Cluster2                              20.59  7.90      48.90    12.46                                    10.15
                                                                  Cluster3                              7.00   40.18     46.23    3.51                                     3.08
         with clear instruction?
                                                                  Cluster4                              42.93  0.76      54.61    0.70                                     2.00
To assess students’ course ratings of clarity based on the
percentage of the grade due to exams, quizzes, homework
and projects, we created a factor made up of four clusters
representing four ways of weighting homework, projects, ex-
ams, quizzes and miscellaneous (such as extra credit) for                                      Table 6: Grading Criteria Statistics
the students’ grade. We begin by sorting the data to only
include observations in which the grading criteria (percent-                                                   Clarity            Std.Err              t-value       Pr<|t|   n
age of the grade determined by homework, projects, exams,         Exam Heavy                                   3.23                0.12                 26.91        0         726
quizzes and miscellaneous) is available and sums to 100. Of       Equal Mix                                    3.52                0.14                 26.04        0         484
the 2,935 observations with syllabus data, there are 2,225 ob-    Exam Proj                                    3.65                0.13                 27.76        0         610
servations with full grading criteria. The difference in these    Exam HW                                      3.12                0.13                 23.53        0         415
numbers represents 710 ratings for which the respondents
4                                                                were not able to find a complete grade breakdown from the
  SQL is a special purpose programming language used only
for relational databases and is not reported.                    online syllabus.
We use k-means clustering to partition the 2,225 observa-
tions with complete grading criteria information based on                           Table 7: Online Syllabi
the five aforementioned variables. We optimize k, our num-
ber of clusters, by examining how the BIC and AIC of the                          Clarity   Std. Err    t-value   Pr<|t|    n
mixture model change based on the number of clusters se-            Available     3.33       0.07        44.48    0          2953
lected. Figure 3 displays the information criterion and Fig-        Not Found     3.26       0.07        46.03    0          7702
ure 4 displays the log-likelihood values for each number of
clusters respectively. A solution involving two clusters min-
imizes the BIC of the model, whereas a four cluster solution
minimizes the AIC. The log likelihood is optimized with the        Research has recently focused on online faculty ratings with
four cluster solution. We consider both two and four cluster       mixed conclusions. Felton et al. [4] found that online instruc-
models as optimal and we find that they lend themselves to         tor ratings were associated with perceived easiness, and that
similar interpretation. The cluster means for the four cluster     a “halo effect” existed in which raters gave high scores to in-
solution are presented in table 5.                                 structors perhaps because their courses were easier. We find
                                                                   that student ratings of clarity and easiness are correlated
The first cluster represents courses that are heavily weighted     (ρ=0.45) although not as strongly associated as clarity and
towards exams with a smaller weight towards homework.              helpfulness. We do find that student ratings of clarity and
The second cluster represents a more even weighting of ex-         helpfulness are highly correlated (ρ=0.84). We chose to fo-
ams, homework, projects and quizzes. The third cluster rep-        cus on clarity ratings as we assumed these were less suscep-
resents an equal weighting towards exams and projects. The         tible to a “halo effect” and other bias relative to the overall
fourth cluster represents courses that are heavily weighted        ratings of a course or professor. Otto et al [13] found issues
towards exams and homework. The cluster membership is              related to bias in online ratings stating that online ratings
treated as a predictor variable and modeled using equation         are characterized by selection bias as anyone can enter fac-
4. Table 6 displays the estimated clarity ratings within each      ulty ratings at any time. Carini et al [1], Hardy [5], McGhee
group for the four cluster solution.                               and Lowell [6] had contradictory results finding that an on-
                                                                   line format did not lead to more biased ratings. Otto et
The exams and projects cluster has the highest estimate of         al. [12] hypothesized that instructor clarity and helpfulness
clarity. We find that weighting projects equally with exams        as captured by Rate My Professor are more positively asso-
is associated with a clearer course experience. The equal          ciated with student learning than easiness.
mix cluster also is associated with higher clarity estimates.
The exam heavy cluster and the exam and homework heavy             Several approaches have been proposed to synthesize re-
clusters are associated with lower student clarity ratings. We     sponses using crowd sourcing systems such as Amazon’s
find that a rubric that weights exams and projects evenly has      Mechanical Turk. Majority voting is perhaps the simplest
higher perceived clarity ratings to a rubric which is weighted     way to combine crowd responses using equal weights irre-
heavily towards exams and homework. This result extends            spective of respondent experience. The results of our pre-
to both two and four cluster solutions.                            liminary analysis in accessing the accuracy of non-Masters
                                                                   level respondents correspond to the steep drop in respon-
                                                                   dent accuracy noted by Karger [9] when low-quality respon-
3.3.3     Does the posting of a syllabus online translate          dents are present. Whitehill et al [15] proposed a proba-
         into higher ratings?                                      bilistic model for combining crowd responses called Genera-
We hypothesize the posting of the syllabus online is a proxy       tive model of Labels, Abilities and Difficulties (GLAD). The
for organization, perhaps motivation or drive of the profes-       GLAD methodology makes use of the EM algorithm to cal-
sor. We make use of all of the data collected to compare stu-      culate parameter estimates of unobserved variables includ-
dent reviews of professors who have a publicly available syl-      ing an approximation of the expertise of the rater. Khattak
labus and of those who do not. Many professors may choose          and Salleb-Aouissi compared the accuracy and percentage
to only post a syllabus through course management systems          of bad responses using majority voting, probabilistic mod-
that require a password. Potential students of these courses       els, and their novel approach entitled Expert Label Injected
are unable to access the syllabus to determine whether the         Crowd Estimation (ELICE) [10]. ELICE makes use of a few
course would be a good fit. We treat the posting of an online      “ground truth” responses and incorporates expertise of the
syllabus as a factor and test for differences in clarity ratings   labeler, difficulty of the instance and an aggregation of la-
between the two groups using our model.                            bels. Khattak and Salleb-Aouissi found that their approach
                                                                   was robust and outperformed GLAD and iterative methods
We find statistically significant differences between clarity,     even when bad labelers were present. Our simple approach
helpfulness and interest ratings and report the clarity es-        was to use Masters level respondents from Mechanical Turk
timates for the two groups in Table 7. We note that the            although GLAD and ELICE are alternative methods to re-
difference in easiness ratings is not statistically significant.   duce the number of expert level respondents required while
We find evidence that students are more interested in pro-         also obtaining high quality data.
fessors and courses in which the syllabus is made publicly
available. We note that the parameter estimates for the two        5.   CONCLUSIONS, LIMITATIONS AND FU-
groups are within one standard error of one another which               TURE WORK
suggests that the conclusions are modest.                          We demonstrate how the Analyzing CurrIculum Decisions
                                                                   (ACID) methodology can be used to leverage collective in-
4.   RELATION TO PRIOR WORK                                        telligence and learn student preferences. In introductory
computer science courses, we find that students that are            8140:94–109, 2013.
taught interpreted languages find their classes clearer. We    [11] A. Moretti, J. Gonzalez-Brenes, and K. McKnight.
also that find students who are given an even weighting of          Towards data–driven curriculum design: Mining the
exams and projects find their classes clearer; and that in-         web to make better teaching decisions. EDM, 2014.
terest in a course corresponds to the availability of an on-   [12] J. Otto, D. A. Sanford Jr, and D. N. Ross. Does
line syllabus. Our study does not necessarily suggest that          ratemyprofessor. com really rate my professor?
teachers should change their programming language. Fur-             Assessment & Evaluation in Higher Education,
ther research is needed before drawing causal inferences. We        33(4):355–368, 2008.
argue that ACID is a beneficial tool to discover patterns in   [13] J. Otto, D. A. Sanford Jr, and W. Wagner. Analysis of
student behavior. Syllabus data and course ratings data are         online student ratings of university faculty. Journal of
becoming increasingly available on the Web. This data is            College Teaching & Learning, 2(7):25–30, 2005.
used by millions of students and worthy of further research.   [14] Shanghai. Academic ranking of world universities.
                                                                    Retrieved from http://www.shanghairanking.com/,
This study can be expanded in several ways. Student eval-           Accessed at 2013 12 01.
uations often include free form text where students can de-
                                                               [15] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and
scribe their experience in the course. Sentiment analysis is
                                                                    J. Movellan. Whose vote should count more: Optimal
a probabilistic approach for categorizing student comments
                                                                    integration of labels from labelers of unknown
as being either positive or negative. One extension is to
                                                                    expertise. Neural Information Processing Systems,
regress text sentiment on course features. There is arguably
                                                                    pages 2035–2043, 2009.
a strong association between comment sentiment and stu-
dent preference. Another way ACID can be applied is to         [16] J. Zelle. Python as a first language. Retrieved from
disciplines other than computer science, or to discover pat-        http://mcsp.wartburg.edu/zelle/python/python-
terns in syllabi across disciplines that can provide insight        first.html/, Accessed at 2014 02
into learner experiences.                                           23.

                                                               APPENDIX
6.   REFERENCES                                                A. SAMPLE OF UNIVERSITIES SELECTED
 [1] R. Carini, J. Hayek, G. Kuh, J. Kennedy, and                                                  Country   n Professors   n Courses   n Reviews
     J. Ouimet. College student responses to web and            Colorado State                      USA       1              9           32
     paper surveys: does mode matter? Research in Higher        Carnegie Mellon University          USA       3              21          102
                                                                North Carolina State                USA       2              10          63
     Education, 44(1):1–19, 2003.                               Pennsylvania State                  USA       12             74          938
 [2] A. Corbett. Cognitive computer tutors: Solving the         Rensselaer Polytechnic Institute    USA       3              22          131
                                                                Rutgers                             USA       8              30          468
     two-sigma problem. In M. Bauer, P. Gmytrasiewicz,          Simon Fraser                        Canada    27             98          1873
     and J. Vassileva, editors, User Modeling 2001, volume      SUNY Stony Brook                    USA       8              55          505
                                                                UC Davis                            USA       10             44          589
     2109 of Lecture Notes in Computer Science, pages           UNC Chapel Hill                     USA       1              4           49
     137–147. Springer Berlin Heidelberg, 2001.                 University of Alberta               Canada    2              6           69
                                                                University of Arizona               USA       3              13          158
 [3] E. J. Emanuel. Online education: Moocs taken by            University of Delaware              USA       15             56          806
     educated few. Nature, 503(7476):342–342, 2013.             University of Florida Gainsville    USA       5              36          321
                                                                University of Illinois at Urbana    USA       5              14          339
 [4] J. Felton and J. Mitchell. Web based student               University of Massachusetts         USA       6              39          405
     evaluations of professors: the relations between           University of Montreal              USA       1              6           59
     perceived quality, easiness and sexiness. Assessment       University of Toronto               Canada    14             66          775
                                                                University of Utah                  USA       2              17          66
     and Evaluation in Higher Education, 29(1):91–108,          University of Virginia              USA       3              19          131
     2004.                                                      University of Waterloo              Canada    46             125         2700
                                                                Vanderbilt University               USA       2              10          76
 [5] N. Hardy. Online ratings: fact and fiction. New
     Directions for Teaching and Learning, (96):31–38,
     2003.
 [6] N. Hardy. Psychometric properties of student ratings
     of instruction in online and on-campus courses. New
     Directions for Teaching and Learning, 2003(96):39–48,
     2003.
 [7] M. Haungs, C. Clark, J. Clements, and D. Janzen.
     Improving first-year success and retention through
     internet-based cs0 courses. ACM SIGCSE, pages
     549–594, 2012.
 [8] S. Jaggars and T. Bailey. Effectiveness of fully online
     courses for college students: Response to a department
     of education meta-analysis. Teachers College:
     Community College Research Center, 2010.
 [9] S. Karger, D. Oh and D. Shah. Budget–optimal task
     allocation for reliable crowdsourcing systems. CoRR,
     arXiv:1110.3564, 2011.
[10] F. Khattak and A. Salleb-Aouissi. Robust crowd
     labeling using little experience. Discovery Science,