=Paper= {{Paper |id=Vol-2450/paper3 |storemode=property |title=Designing for Serendipity in a University Course Recommendation System |pdfUrl=https://ceur-ws.org/Vol-2450/paper3.pdf |volume=Vol-2450 |authors=Zach Pardos,Weijie Jiang |dblpUrl=https://dblp.org/rec/conf/recsys/PardosJ19 }} ==Designing for Serendipity in a University Course Recommendation System== https://ceur-ws.org/Vol-2450/paper3.pdf

Designing for Serendipity in a University Course
Recommendation System
Zachary Pardos Weijie Jiang
University of California, Berkeley Tsinghua University & University of California, Berkeley
zp@berkeley.edu jiangwj.14@sem.tsinghua.edu.cn

ABSTRACT made from collaborative based methods. This problem of training
Collaborative filtering based algorithms, including Recurrent Neu- on the past without necessarily repeating it is an open problem
ral Networks (RNN), tend towards predicting a perpetuation of in many collaborative filtering based recommendation contexts,
past observed behavior. In a recommendation context, this can lead particularly social networks, where, in the degenerate cases, users
to an overly narrow set of suggestions lacking in serendipity and can get caught in "filter bubbles," or model-based user stereotypes,
inadvertently placing the user in what is known as a "filter bubble." leading to a narrowing of item recommendation variety [10, 12, 23].
In this paper, we grapple with the issue of the filter bubble in the To counteract the filter bubble, we introduce a course2vec vari-
context of a course recommendation system in production at a ant into a production recommender system at a public university
public university. Our approach is to present course results that designed to surface serendipitous course suggestions. Course2vec
are novel or unexpected to the student but still relevant to their applies a skip-gram to course enrollment histories, instead of nat-
interests. We build one set of models based on the course catalog ural language, in order to learn representations. We use the def-
description (BOW) and another set informed by enrollment histo- inition of serendipity as user perceived unexpectedness of result
ries (course2vec). We compare the performance of these models on combined with successfulness [20], which we define as a course
off-line validation sets and against the system’s existing RNN-based recommendation a student expresses interest in taking. At many
recommendation engine in a user study of undergraduates (N = 70) universities, conceptually similar courses exist across departments
who rated their course recommendations along six characteristics but use widely differing disciplinary vernacular in their catalog
related to serendipity. Results of the user study show a dramatic descriptions, making them difficult for learners to search for and
lack of novelty in RNN recommendations and depict the charac- to realize their commonality. We propose that by tuning a vector
teristic trade-offs that make serendipity difficult to achieve. While representation of courses learned from nine years of enrollment
the machine learned course2vec models performed best on con- sequences, we can capture enough implicit semantics of the courses
cept generalization tasks (i.e, course analogies), it was the simple to more abstractly, and accurately construe similarity. To encourage
bag-of-words based recommendations that students rated as more the embedding to learn features that may generalize across depart-
serendipitous. We discuss the role of the recommendation interface ments, our skip-gram variant simultaneously learns department
and the information presented therein in the student’s decision to (and instructor) embeddings. While more advanced attention-based
accept a recommendation from either algorithm. text generation architectures exist [21], we demonstrate that prop-
erties of the linear vector space produced by "shallow" networks are
CCS CONCEPTS of utility to this recommendation task. Our recommendations are
made with only a single explicit course preference given by the user,
• Applied computing → Education; • Information systems
as opposed to the entire course selection history needed by session-
→ Recommender systems.
based Recurrent Neural Network approaches [8]. Single example,
KEYWORDS also known as "one-shot," generalization is common in the vision
community, which has pioneered approaches to extrapolating a
Higher education, course guidance, filter bubble, neural networks category from a single labeled example [7, 22]. Other related work
applying skip-grams to non-linguistic data include node embed-
1 INTRODUCTION dings learned from sequences of random walks of graphs [19] and
Among the institutional values of a liberal arts university is to product embeddings learned from ecommerce clickstream [2]. Our
expose students to a variety of perspectives expressed in courses work, methodologically, adds rigor to this approach by tuning the
across its various physical and intellectual schools of thought. Col- model against validation sets created from institutional knowledge
laborative filtering based sequence prediction methods, in this envi- and curated by the university.
ronment, can provide personalized course recommendations based We conduct a user study (N = 70) of undergraduates at the Univer-
on temporal models of normative behavior [14] but are not well sity to evaluate their personalized course recommendations made
suited for surfacing courses a student may find interesting but by our models designed for serendipity and by the RNN-based
which have been relatively unexplored by those with similar course engine, which previously drove recommendations in the system.
selections to them in the past. Therefore, a more diversity oriented The findings underscore the tension between unexpectedness and
model can serve as an appropriate compliment to recommendations successfulness and show the deficiency of RNNs for producing
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons
novel recommendations. While our course2vec based model scored
License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop on Interfaces 68% above bag-of-words in accuracy in one of our course analogy
and Human Decision Making for Recommender Systems, 19 Sept 2019, Copenhagen, DK validation set, simple bag-of-words scored slightly higher in the
.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

main objective of user perceived serendipity. A potential reason
for this discrepancy is the nature of information presented to stu-
dents in the recommender system interface. Catalog descriptions
of recommended courses were shown to students, which served
as the only source of information they could consult in deciding if
they wanted to take the course. A generated explanation, or prior-
itization of the course2vec recommendation in the interface may
be required to overcome the advantage of the bag-of-words model
being based on the same information being shown to them in the
recommendations.
Recommender systems in higher education contexts have re-
cently focused on prediction of which courses a student will take
[14, 17] or the grade they will receive if enrolled [9, 18]. At Stanford,
a system called "CARTA" allows students to see grade distributions,
course evaluations, and the most common courses taken before a
course of interest [1]. At UC Berkeley, the recommender system
being modified in this study serves students next-semester course Figure 1: multi-factor course2vec model
considerations based on their personal course enrollment history
[14]. Earlier systems included a focus on requirement satisfaction
[13] and career-based relevancy recommendation [6]. No system
has yet focused on serendipitous or novel course discovery. course. In this section, we consider adding features of a course to
the input to enhance the classifier and its representations, as shown
in Figure 1. Each course is taught by one or several instructors
2 MODELS AND METHODOLOGY
over the years and is associated with an academic department.
This section introduces three competing models used to generate The multi-factor course2vec model learns both course and course
our representations. The first model uses course2vec [14] to learn feature representations by maximizing the objective function over
course representations from enrollment sequences. Our second all the students’ enrollment sequences and the features of courses.
model is a variant on course2vec, which learns representations of Full technical details can be found in [15].
explicitly defined features of a course (e.g., instructor or department) In language models, two word vectors will be cosine similar if
in addition to the course representation. The intuition behind this they share similar sentence contexts. Likewise, in the university
approach is that the course representation could have, conflated domain, courses that share similar co-enrollments, and similar pre-
in it, the influence of the multiple-instructors that have taught vious and next semester enrollments, will likely be close to one
the course over time. We contend that this "deconflation" may another in the vector space.
increase the fidelity of the course representation and serve as a
more accurate representation of the topical essence of the course.
2.2 Bag-of-Words
The last representation model is a standard bag-of-words vector,
constructed for each course strictly from its catalog description. A simple but indelible approach to item representation has been to
Finally, we explore concatenating a course’s course2vec and bag-of- create a vector, the length of the number of unique words across all
words representation vector. items, with a non-zero value if the word in the vocabulary appears
in it. Only unigram words are used to create this unordered vector
list of words used to represent the document [3].
2.1 Course2vec The basic methodology based on bag-of words proposed by IR
The course2vec model involves learning distributed representations researchers for text corpora - a methodology successfully deployed
of courses from students’ enrollment records throughout semesters in modern Internet search engines - reduces each document in the
by using a notion of an enrollment sequence as a "sentence" and corpus to a vector of real numbers, each of which represents a term
courses within the sequence as "words", borrowing terminology weight. The term weight might be:
from the linguistic domain. For each student s, a chronological
course enrollment sequence is produced by first sorting by semes- • a term frequency value indicating how many times the term
ter then randomly serializing within-semester course order. Then, occurred in the document.
each course enrollment sequence is used in training, similar to a • a binary value with 1 indicating that the term occurred in
document in a classical skip-gram application. the document, and 0 indicating that it did not.
The training objective of the skip-gram model is to find word • tf-idf scheme [4], the product of term frequency and inverse
representations that are useful for predicting the surrounding words document frequency, which increases proportionally to the
in a sentence or a document. Each word in the corpus is used as an number of times a word appears in the document and is
input to a log-linear classifier with continuous projection layer, to offset by the frequency of the word in the corpus and helps
predict words within a certain range before and after the current to adjust for the fact that some words appear more frequently
word. Therefore, the skip-gram model can be also deemed as a in general.
classifier with input as a target course and output as a context We evaluate all three variants in our quantitative validation testing.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

2.3 Surfacing Serendipitous Recommendations similarity based ground truth. A course is paired with an-
from Course Representations other course in this set if a student can only receive credit for
taking one of the courses. For example, an honors and non-
We surface recommendations intended to be interesting but unex-
honors version of a course will be appear as a pair because
pected by finding an objective course c j which is most similar to
faculty have deemed that there is too much overlapping ma-
a student’s favorite course c i , diversifying the results by allowing
terial between the two for a student to receive credit for
only one result per department d j :
both.
c ∗j = arg max cos(c, c i ) (1) • Analogy validation set: The standard method for validating learned
c,d (c)=d j
word vectors has been to use analogy to test the degree to which the
where d(c) means the the department of course c. Then all the embedding structure contains semantic and syntactic relationships
counterpart courses c ∗j in all the other departments will be ranked constructed from prior knowledge. In the domain of university
according to cos(c ∗j , c i ), where j = 1, 2..., k. We can apply both neu- courses, we use course relationship pairs constructed from prior
ral representations and bag-of-words representations of courses in work using first-hand knowledge of the courses [16]. The 77 rela-
this method to generate the most similar courses in each depart- tionship pairs were in five categories; online, honors, mathematical
ment. rigor, 2-department topics, and 3-department topics. An example of
an "online" course pair would be Engineering 7 and its online coun-
3 EXPERIMENTAL ENVIRONMENTS terpart, Engineering W7 or Education 161 and W161. An analogy
involving these two paris could be calculated as: Engineering 7W -
3.1 Off-line Dataset Engineering 7 + Education 161 ≈ EducationW 161.
We used a dataset containing anonymized student course enroll-
ments at UC Berkeley from Fall 2008 through Fall 2017. The dataset 3.2 Online Environment (System Overview)
consists of per-semester course enrollment records for 164,196 stu- The production recommender system at UC Berkeley uses a stu-
dents (both undergraduates and graduates) with a total of 4.8 million dent data pipeline with the enterprise data warehouse to keep
enrollments. A course enrollment record means that the student up-to-date enrollment histories of students. Upon CAS login, these
was still enrolled in the course at the end of the semester. Students at histories are associated with the student and passed through an
this university, during this period, were allowed to drop courses up RNN model, which cross-references the output recommendations
until close to the end of the semester without penalty. The median with the courses offered in the target semester. Class availability
course load during students’ active semesters was four. There were information is retrieved during the previous semester from a cam-
9,478 unique lecture courses from 214 departments1 hosted in 17 pus API once the registrar has released the schedule. The system is
different Divisions of 6 different Colleges. Course meta-information written with an AngularJS front-end and python back-end service
contains course number, department name, total enrollment and which loads the machine learned models written in pyTorch. These
max capacity. In this paper, we only consider lecture courses with models are version controlled on github and refreshed three times
at least 20 enrollments total over the 9-year period, leaving 7,487 per semester after student enrollment status refreshes from the
courses. Although courses can be categorized as undergraduate pipeline. The system receives traffic of around 20% of the under-
courses and graduate courses, undergraduates are permitted to graduate student body, partly from the UC Berkeley Registrar’s
enroll in many graduate courses no matter their status. website.
Enrollment data were sourced from the campus enterprise data
warehouse with course descriptions sourced from the official cam- 4 VECTOR MODEL REFINEMENT
pus course catalog API. We pre-processed the course description
data in the following steps: (1) removing generic, often-seen sen-
EXPERIMENTS
tences across descriptions (2) removing stop words (3) removing In this section, we first introduce our experiment parameters and
punctuation (4) word lemmatization and stemming, and finally the ways we validated the representations quantitatively. Then, we
tokenizing the bag-of-words in each course description. We then describe the various ways in which we refined the models and the
compiled the term frequency vector, binary value vector, and tf-idf results of these refinement.
vector for each course.
4.1 Model Evaluations
3.1.1 Semantic Validation Sets. In order to quantitatively evaluate
We trained the models described in Section 2.1 on the student
how accurate the vector models are, a source of ground truth on
enrollment records data. Specifically, we added the instructor(s)
the relationships between courses needed to brought to bear to
who teach the course and the course department as two input
see the degree to which the vector representations encoded this
features of courses in the multi-factor course2vec model.
information. We used two such sources of ground truth to serve as
To evaluate course vectors on the course equivalency validation
validation sets, one providing information on similarity, the other
set, we fixed the first course in each pair and ranked all the other
on a variety of semantic relationships.
courses according to their cosine similarity to the first course in
• Equivalency validation set: A set of 1,351 course credit-equivalency descending order. We then noted the rank of the expected second
pairs maintained by the Office of the Registrar were used for course in the pair and described the performance of each model
1 At UC Berkeley, the smallest academic unit is called a "subject." For the purpose of on all validation pairs in terms of mean rank, median rank and
communicability, we instead refer to subjects as departments. recall@10.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

For evaluation of the course analogy validation set, we followed 0.5647). While the median rank of the concatenated model only im-
the analogy paradigm of: course2 − course1 + course3 ≈ course4. proved one rank, from 4 to 3, the mean rank improved dramatically
Courses were ranked by their cosine similarity to course2−course1+ (from 566 to 132), and is the best of all models tested in terms of
course3. An analogy completion is considered accurate (a hit) if mean rank. Non-normalized vectors did not show improvements
the first ranked course is the expected course4 (excluding the other over bag-of-words alone in median rank and recall@10. Improve-
three from the list). We calculated the average accuracy (recall@1) ments in the analogy test were more mild, with a recall@10 of
and the recall@10 over all the analogies in the analogy validation 0.8788 for the best concatenated model, combining binary bag-of-
set. words with multi-factor course2vec, compared with 0.8557 for the
best course2vec only model. Normalization in the case of analogies
4.2 Course2vec vs. Multi-factor Course2vec hurt all model performance, the opposite of what was observed in
We compared the pure course2vec model with the course represen- the equivalency test. This suggests that normalization improves
tations from the multi-factor course2vec model using instructor, local similarity but may act to degrade the more global structure of
department, and both as factors. Full results of evaluation on the the vector space.
equivalency validation and analogy validation are shown in [15].
The multi-factor model outperformed the pure course2vec model
5 USER STUDY
in terms of recall@10 in both validation sets, with the combined
instructor and department factor model performing the best. A user study was conducted to evaluate the quality of recommen-
dations drawn from our different course representations. Users
4.3 Bag-of-words vs. Multi-factor Course2vec rated each course from each recommendation algorithm along five
dimensions of quality. Students were asked to rate course recom-
Among the three bag-of-words models, tf-idf performed the best in
mendations in terms of their (1) unexpectedness (2) successfulness
all equivalency set metrics. The median rank (best=4) and recall@10
/ interest in taking the course (3) novelty (4) diversity of the results
(best=0.5647) for the bag-of-words models were also substantially
(5) and identifiable commonality among the results. In Shani and
better than the best course2vec models, which had a best median
Gunawardana [20], authors defined serendipity as the combination
rank of 15 with best recall@10 of 0.4485 for the multi-factor in-
of "unexpectedness" and "success." In the case of a song recom-
structor and department model. All course2vec models; however,
mender, for example, success would be defined as the user listening
showed better mean rank performance (best=224) compared with
to the recommendation. In our case, we use a student’s expression
bag-of-words (best=566). This suggests that there are many outliers
of interest in taking the course as a proxy for success. The mean
where literal semantic similarity (bag-of-words) is very poor at
of their unexpectedness and successfulness rating will comprise
identifying equivalent pairs, whereas course2vec has much fewer
our measure of serendipity. We evaluated three of our developed
near worst-case examples. This result is consistent with prior work
models, all of which displayed 10 results, only showing one course
comparing pure course2vec models to binary bag-of-words [14].
per department in order to increase diversity (and unexpected-
When considering performance on the analogy validation, the
ness). The models were (1) the best BOW model (tf-idf), (2) the best
roles are reversed, with all course2vec models performing better
Analogy validation model (binary BOW + multi-factor course2vec
than the bag-of-words models in both accuracy and recall@10. The
normalized), (3) and the best Equivalency validation model (tf-idf +
difference in recall of bag-of-words compared to course2vec when
multi-factor course2vec non-normalized). To measure the impact
it comes to analogies is substantial (0.581 vs 0.8557), a considerably
our department diversification filter would have on serendipity,
larger difference than between bag-of-words and course2vec on
we added a version of the best Equivalency model that did not
equivalencies (0.5647 vs 0.4485). Again, the multi-factor instructor
impose this filter, allowing multiple courses to be displayed from
and department model and tf-idf were the best models in their
the same department if they were the most cosine similar to the
respective class. These analyses establish that bag-of-words models
user’s specified favorite course. Our fifth comparison recommen-
are moderately superior in capturing course similarity, but are
dation algorithm was the system’s existing collaborative-filtering
highly inferior to enrollment-based course2vec models in the more
based Recurrent Neural Network (RNN) that recommends courses
complex task of analogy completion.
based on a prediction of what the student is likely to take next
given their personal course history and what other students with a
4.4 Combining Bag-of-words and Course2vec similar history have taken in the past [14]. All five algorithms were
Representations integrated into a real-world recommender system for the purpose
In light of the strong analogy performance of course2vec and strong of this study and evaluated by 70 undergraduates at the University.
equivalency performance bag-of-words in the previous section,
we concatenated the multi-factor course2vec representations with
bag-of-words representations. To address the different magnitudes 5.1 Study Design
in the vectors between the two concatenated representations, we Undergraduates were recruited from popular University associated
create a normalized version of each vector set for comparison to Facebook groups and asked to sign-up for a one hour evaluation
non-normalized sets. session. Since they would need to specify a favorite course they had
We found that the normalized concatenation of tf-idf with multi- taken, we restricted participants to those who had been at the Uni-
factor course2vec performed substantially better on the equivalency versity at least one full semester and were currently enrolled. The
test than the previous best model in terms of recall@10 (0.6435 vs. study was run at the beginning of the Fall semester, while courses
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

could still be added and dropped and some students were still shop-
ping for courses. We used a within-subjects design whereby each
volunteer rated ten course recommendations made by each of the
five algorithms. Because of the considerable number of ratings
expected ([3*10+2]*5 = 160) and the importance for students to
carefully consider each recommended course, in-person sessions
were decided on over asynchronous remote sessions in order to
better encourage on-task behavior throughout the session. Student
evaluators were compensated with a $40 gift card to attend one of
four sessions offered across three days with a maximum occupancy
of 25 in each session. A total of 702 students participated.
We began the session by introducing the evaluation motivation
as a means for students to help inform the choice of algorithm
that we will use for a future campus-wide deployment of a course
exploration tool. Students started the evaluation by visiting a sur-
vey URL that asked them to specify a favorite course they had
taken at the University. This favorite course was used by the first
four algorithms to produce 10 course recommendations each. Each
recommended course’s department, course number, title, and full
Figure 2: User survey page
catalog description were displayed to the student in the interface.
There was a survey page (Figure 2) for each algorithm in which
students were asked to read the recommended course descriptions
carefully and then rate each of the ten courses individually on a algorithms, there were no statistically significant differences except
five point Likert scale agreement with the following statements: for BOW scoring higher than Equivalency (div) on unexpectedness
(1) This course was unexpected (2) I am interested in taking this and scoring higher than both Equivalency (div) and Analogy (div)
course (3) I did not know about this course before. These ratings on novelty. Among the two non-diversity algorithms, there were
respectively measured unexpectedness, successfulness, and novelty. no statistically significant differences except for the RNN scoring
After rating the individual courses, students were asked to rate higher on diversity and Equivalency (non-div) recommendations
their agreement with the following statements pertaining to the scoring higher on novelty. With respect to measures of serendipity,
10 results as a whole: (1) Overall, the course results were diverse the div and non-div algorithms had similar scores among their re-
(2) The course results shared something in common with my fa- spective strengths (3.473-3.619); however, the non-div algorithms
vorite course. These ratings measured dimensions of diversity and scored substantially lower in their weak category of unexpectedness
commonality. Lastly, students were asked to provide an optional (2.091 & 2.184) than did the div algorithms in their weak category of
follow-up open text response to the question, "If you identified successfulness (2.851-2.999), resulting in statistically significantly
something in common with your favorite course, please explain it higher serendipity scores for the div algorithms.
here." On the last page of the survey, students were asked to specify The most dramatic difference can be seen in the measure of nov-
their major, year, and to give optional open response feedback on elty, where BOW (div) scored 3.896 and the system’s existing RNN
their experience. Graduate courses were not included in the recom- (non-div) scored 1.824, the lowest rating in the results matrix. The
mendations and the recommendations were not limited to courses proportion of each rating level given to the two algorithms on this
available in the current semester. question is shown in Figures 3 and 5. Hypothetically, an algorithm
that recommended randomly selected courses would score high
5.2 Results in both novelty and unexpectedness, and thus it is critical to also
Results of average student ratings of the five algorithms across weigh their ability to recommend courses that are also of interest
the six measurement categories are shown in Table 1. The diver- to students. Figure 4 shows successfulness ratings for each of the
sity based algorithms, denoted by "(div)," all scored higher than algorithms aggregated by rank of the course result. The non-div
the non-diversity (non-div) algorithms in unexpectedness, novelty, algorithms, shown with dotted lines, always perform as well or
diversity, and the primary measure of serendipity. The two non- better than the div algorithms at every rank. The more steeply de-
diversity based algorithms; however, both scored higher than the clining slope of the div algorithms depicts the increasing difficulty
other three algorithms in successfulness and commonality. All pair- of finding courses of interest across different departments. The ten-
wise differences between diversity and non-diversity algorithms sion between the ability to recommend courses of interest that are
were statistically significant, using the p < 0.001 level after applying also unexpected is shown in Figure 6, where the best serendipitous
a Bonferoni correction for multiple (60) tests. Within the diversity model, BOW (div), recommends a top course of higher success-
fulness than unexpectedness, with the two measures intersecting
2 Due to an authentication bug during the fourth session, twenty participating students at rank 2 and diverging afterwards. The best equivalency model,
were not able to access the collaborative recommendations of the fifth algorithm. RNN combining course description tf-idf and course2vec (non-div), main-
results in the subsequent section are therefore based on the 50 students from the first
three sessions. When paired t-tests are conducted between RNN and the ratings of tains high successfulness but also maintains low unexpectedness
other algorithms, the tests are between ratings among these 50 students. across the 10 course recommendation ranks.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

Table 1: Average student ratings of recommendations from the five algorithms across the six measurement categories.

algorithm unexpectedness successfulness serendipity novelty diversity commonality
BOW (div) 3.550 2.904 3.227 3.896 4.229 3.229
Analogy (div) 3.473 2.851 3.162 3.310 4.286 2.986
Equivalency (div) 3.297 2.999 3.148 3.323 4.214 3.257
Equivalency (non-div) 2.091 3.619 2.855 2.559 2.457 4.500
RNN (non-div) 2.184 3.566 2.875 1.824 3.160 4.140

Figure 3: Student novelty rating proportions of course rec- Figure 5: Student novelty rating proportions of course rec-
ommendations produced by BOW (div) ommendations produced by RNN (non-div)

Figure 6: BOW (div) vs. Equivalency (non-div) comparison

5.3 Qualitative Characterization of Algorithms
Figure 4: Successfulness comparison In this section, we attempt to synthesize qualitative characteriza-
tions of the different algorithms by looking at the open responses
students gave to the question asking them to describe any common-
alities they saw among recommendations made by each algorithm
Are more senior students less likely to rate courses as novel or
to their favorite course.
unexpected, given they have been at the University longer and been
exposed to more courses? Among our sophomore (27), junior (22), 5.3.1 BOW (div). Several students remarked positively about rec-
and senior (21) level students, there were no statistically significant ommendations matching to the themes of "art, philosophy, and
trends among the six measures, except for a marginally significant society" or "design" exhibited in their favorite course. The word
trend (p = 0.007, shy of the p < 0.003 threshold given the Bonferroni "language" was mentioned by 14 of the 61 respondents answering
correction) of more senior students rating recommendations as less the open response question. Most of these comments were negative,
unexpected (avg = 2.921) than juniors (avg = 3.024), whose ratings pointing out the limitations of similarity matching based solely
were not statistically separable from sophomores (avg = 3.073). on literal course description matching. The most common critique
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

given in this category was of the foreign spoken language courses enrolled in. Due to the institutional data refresh schedule, student
that showed up at the lower ranks when students specified a fa- current enrollments are not known until after the add/drop deadline.
vorite course involving programming languages. Other students This may be a shortcoming that can be rectified in the future.
remarked at additional dissimilarity when specifying a favorite
course related to cyber security, receiving financial security courses 6 FEATURE RE-DESIGN
in the results.
As a result of the feedback received from the user study, we worked
5.3.2 Analogy (div). The word "interesting" appeared in seven of with campus to pull down real-time information on student require-
the 54 comments left by students to describe commonalities among ment satisfaction from the Academic Plan Review module of the
the analogy validation optimized algorithm. This word was not PeopleSoft Student Information System. We re-framed the RNN fea-
among the top 10 most frequent words in any of the other four ture as a "Requirements" satisfying feature that, upon log-in, shows
algorithms. Several students identified broad themes among the students their personalized list of unsatisfied requirements (Figure
courses that matched to their favorite course, such as "identity" and 8). After selecting a requirement category to satisfy, the system
"societal development." On the other end of the spectrum, one stu- displays courses which satisfy the selected requirement and are of-
dent remarked that the results "felt weird" and were only "vaguely fered in the target semester. The list of courses is sorted by the RNN
relevant." Another student stated that, "the most interesting sugges- to represent the probability that students like them will take the
tion was the Introduction to Embedded Systems [course] which is class. This provides a signal to the student of what the normative
just different enough from my favorite course that it’s interesting course taking behavior is in the context of requirement satisfaction.
but not too different that I am not interested," which poignantly ar- For serendipitous suggestions, we created a separate "Explore" tab
ticulates the crux of difficulty in striking a balance between interest (Figure 7) using the BOW (div) model to surface the top five courses
and unexpectedness to achieve a serendipitous recommendation. similar across departments, due to its strong serendipitous and nov-
5.3.3 Equivalency (div). Many students (seven of the 55) remarked elty ratings. The Equivalency (non-div) model was used to display
positively on the commonality of the results with themes of data an additional most similar five courses within the same department.
exhibited by their favorite course (in most cases STATS C8, an in- This model was chosen due to its strong successfulness ratings.
troductory data science course). They mentioned how the courses
all involved "interacting with data in different social, economic, 7 DISCUSSION
and psychological contexts" and "data analysis with different ap- Surfacing courses that are of interest but not known before means
plications." One student remarked on this algorithm’s tendency to expanding a student’s knowledge and understanding of the Univer-
match at or around the main topic of the favorite course, further re- sity’s offerings. As students are exposed to courses that veer further
marking that "they were relevant if looking for a class tangentially from their home department and nexus of interest and understand-
related." ing, recommendations become less familiar with descriptions that
5.3.4 Equivalency (non-div). This algorithm was the same as the are harder to connect with. This underscores the difficulty of pro-
above, except that it did not limit results to one course per depart- ducing an unexpected but interesting course suggestion, as it often
ment. Because of this lack of department filter, 15 of the 68 students must represent a recommendation of uncommon wisdom in order
submitting open text responses to the question of commonality to extend outside of a student’s zone of familiarity surrounding
pointed out that the courses returned were all from the same de- their centers of interest. Big data can be a vehicle for, at times,
partment. Since this model scored highest on a validation task of reaching that wisdom. Are recommendations useful when they
matching to a credit equivalent course pair (almost always in the suggest something expected or already known? Two distinct sets of
same department), it is not surprising that students observed that responses to this question emerged from student answers to the last
results from this algorithm tended to all come from the department open ended feedback question. One representative remark stated,
of the favorite course, which also put it close to their nexus of
"The best algorithms were the ones that had more
interest.
diverse options, while still staying true to the core
5.3.5 RNN (non-div). The RNN scored lowest in novelty, signifi- function of the class I was searching. The algo-
cantly lower than the other non-div algorithm, and was not signifi- rithms that returned classes that were my major
cantly different from the other non-div algorithm in successfulness. requirements/in the same department weren’t as
In this case, what is the possible utility of the collaborative-based helpful because I already knew of their existence
RNN over the non-div Equivalency model? Many of the 47 (of 50) as electives I could be taking"
student answers to the open response commonality question re-
marked that the recommendations related to their major (mentioned While a different representative view was expressed with,
by 21 students) and contained courses that fulfilled a requirement
"I think the fifth algorithm [RNN] was the best fit
(mentioned by seven) as the distinguishing signature of this algo-
for me because my major is pretty standardized"
rithm. Since the RNN is based on normative next course enrollment
behavior, it is reasonable that it suggested many courses that satisfy These two comments make a case for both capabilities being of
an unmet requirement. This algorithm’s ability to predict student importance. They are also a reminder of the desire among young
enrollments accurately became a detriment to some, as seven re- adults for the socio-technical systems of the university to offer a
marked that it was recommending courses that they were currently balance of information, exploration and, at times, guidance.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

Figure 8: The “Requirements" Interface

utility of the neural embedding is that students had to rely on the
course description semantics in order to familiarize themselves
with the suggested course and determine if they were interested
in taking it. If a concept was detected by the neural embedding
but not the BOW, this likely meant that the concept was difficult
to pick-up from the course description displayed to students. Past
work has shown that users evaluate media recommendations less
favorably before they take the recommendation than after when im-
portant aspects of the recommended content is not described in the
recommendation [11]. Future work could augment recommended
course descriptions with additional information, including latent
semantics inferred from enrollments [5] or additional semantics
retrieved from available course syllabi.

ACKNOWLEDGMENTS
This work was partly supported by the United States National
Science Foundation (1547055/1446641) and the National Natural
Science Foundation of China (71772101/71490724).

REFERENCES
[1] Sorathan Chaturapruek, Thomas Dee, Ramesh Johari, René Kizilcec, and Mitchell
Stevens. 2018. How a data-driven course planning tool affects college students’
GPA: evidence from two field experiments. (2018).
Figure 7: The “Explore" Interface [2] Hung-Hsuan Chen. 2018. Behavior2Vec: Generating Distributed Representations
of UsersâĂŹ Behaviors on Products for Recommender Systems. ACM Transactions
on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 43.
[3] D Manning Christopher, Raghavan Prabhakar, and Schacetzel Hinrich. 2008.
Introduction to information retrieval. An Introduction To Information Retrieval
8 LIMITATIONS 151, 177 (2008), 5.
The more distal a course description is, even if conceptually similar, [4] Martin Dillon. 1983. Introduction to modern information retrieval: G. Salton and
M. McGill. McGraw-Hill, New York (1983). 448 pp., ISBN 0-07-054484-0.
the less a student may be able to recognize the commonality with [5] Matt Dong, Run Yu, and Zach A. Pardos. in press. Design and Deployment of a
a favorite course. A limitation of our study in demonstrating the Better Course Search Tool: Inferring latent keywords from enrollment networks.
IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang

In Proceedings of the 14th European Conference on Technology Enhanced Learning. personalized course guidance. User Modeling and User-Adapted Interaction 29, 2
Springer. (2019), 487–525.
[6] Rosta Farzan and Peter Brusilovsky. 2011. Encouraging user participation in a [15] Zachary A Pardos and Weijie Jiang. 2019. Combating the Filter Bubble: Designing
course recommender system: An impact on user behavior. Computers in Human for Serendipity in a University Course Recommendation System. arXiv preprint
Behavior 27, 1 (2011), 276–284. arXiv:1907.01591 (2019).
[7] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object [16] Zachary A Pardos and Andrew Joo Hun Nam. 2018. A Map of Knowledge. CoRR
categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 preprint, abs/1811.07974 (2018). https://arxiv.org/abs/1811.07974
(2006), 594–611. [17] Agoritsa Polyzou, N Athanasios, and George Karypis. 2019. Scholars Walk: A
[8] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Markov Chain Framework for Course Recommendation. In Proceedings of the
Tikk. 2016. Parallel recurrent neural network architectures for feature-rich 12th International Conference on Educational Data Mining. 396–401.
session-based recommendations. In Proceedings of the 10th ACM Conference on [18] Zhiyun Ren, Xia Ning, Andrew S Lan, and Huzefa Rangwala. 2019. Grade
Recommender Systems. ACM, 241–248. Prediction Based on Cumulative Knowledge and Co-taken Courses. In Proceedings
[9] Weijie Jiang, Zachary A Pardos, and Qiang Wei. 2019. Goal-based course rec- of the 12th International Conference on Educational Data Mining. 158–167.
ommendation. In Proceedings of the 9th International Conference on Learning [19] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec:
Analytics & Knowledge. ACM, 36–45. Learning node representations from structural identity. In Proceedings of the 23rd
[10] Judy Kay. 2000. Stereotypes, student models and scrutability. In International ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Conference on Intelligent Tutoring Systems. Springer, 19–30. ACM, 385–394.
[11] Benedikt Loepp, Tim Donkers, Timm Kleemann, and Jürgen Ziegler. 2018. Impact [20] Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems.
of item consumption on assessment of recommendations in user studies. In In Recommender systems handbook. Springer, 257–297.
Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 49–53. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[12] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Konstan. 2014. Exploring the filter bubble: the effect of using recommender you need. In Advances in neural information processing systems. 5998–6008.
systems on content diversity. In Proceedings of the 23rd international conference [22] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-
on World wide web. ACM, 677–686. ing networks for one shot learning. In Advances in Neural Information Processing
[13] Aditya Parameswaran, Petros Venetis, and Hector Garcia-Molina. 2011. Rec- Systems. 3630–3638.
ommendation systems with complex constraints: A course recommendation [23] Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. 2012.
perspective. ACM Transactions on Information Systems (TOIS) 29, 4 (2011), 20. Auralist: introducing serendipity into music recommendation. In Proceedings
[14] Zachary A Pardos, Zihao Fan, and Weijie Jiang. 2019. Connectionist recom- of the fifth ACM international conference on Web search and data mining. ACM,
mendation in the wild: on the utility and scrutability of neural networks for 13–22.