=Paper= {{Paper |id=Vol-2450/paper3 |storemode=property |title=Designing for Serendipity in a University Course Recommendation System |pdfUrl=https://ceur-ws.org/Vol-2450/paper3.pdf |volume=Vol-2450 |authors=Zach Pardos,Weijie Jiang |dblpUrl=https://dblp.org/rec/conf/recsys/PardosJ19 }} ==Designing for Serendipity in a University Course Recommendation System== https://ceur-ws.org/Vol-2450/paper3.pdf
                       Designing for Serendipity in a University Course
                                  Recommendation System
                                  Zachary Pardos                                                                    Weijie Jiang
                      University of California, Berkeley                                     Tsinghua University & University of California, Berkeley
                              zp@berkeley.edu                                                          jiangwj.14@sem.tsinghua.edu.cn

ABSTRACT                                                                                     made from collaborative based methods. This problem of training
Collaborative filtering based algorithms, including Recurrent Neu-                           on the past without necessarily repeating it is an open problem
ral Networks (RNN), tend towards predicting a perpetuation of                                in many collaborative filtering based recommendation contexts,
past observed behavior. In a recommendation context, this can lead                           particularly social networks, where, in the degenerate cases, users
to an overly narrow set of suggestions lacking in serendipity and                            can get caught in "filter bubbles," or model-based user stereotypes,
inadvertently placing the user in what is known as a "filter bubble."                        leading to a narrowing of item recommendation variety [10, 12, 23].
In this paper, we grapple with the issue of the filter bubble in the                            To counteract the filter bubble, we introduce a course2vec vari-
context of a course recommendation system in production at a                                 ant into a production recommender system at a public university
public university. Our approach is to present course results that                            designed to surface serendipitous course suggestions. Course2vec
are novel or unexpected to the student but still relevant to their                           applies a skip-gram to course enrollment histories, instead of nat-
interests. We build one set of models based on the course catalog                            ural language, in order to learn representations. We use the def-
description (BOW) and another set informed by enrollment histo-                              inition of serendipity as user perceived unexpectedness of result
ries (course2vec). We compare the performance of these models on                             combined with successfulness [20], which we define as a course
off-line validation sets and against the system’s existing RNN-based                         recommendation a student expresses interest in taking. At many
recommendation engine in a user study of undergraduates (N = 70)                             universities, conceptually similar courses exist across departments
who rated their course recommendations along six characteristics                             but use widely differing disciplinary vernacular in their catalog
related to serendipity. Results of the user study show a dramatic                            descriptions, making them difficult for learners to search for and
lack of novelty in RNN recommendations and depict the charac-                                to realize their commonality. We propose that by tuning a vector
teristic trade-offs that make serendipity difficult to achieve. While                        representation of courses learned from nine years of enrollment
the machine learned course2vec models performed best on con-                                 sequences, we can capture enough implicit semantics of the courses
cept generalization tasks (i.e, course analogies), it was the simple                         to more abstractly, and accurately construe similarity. To encourage
bag-of-words based recommendations that students rated as more                               the embedding to learn features that may generalize across depart-
serendipitous. We discuss the role of the recommendation interface                           ments, our skip-gram variant simultaneously learns department
and the information presented therein in the student’s decision to                           (and instructor) embeddings. While more advanced attention-based
accept a recommendation from either algorithm.                                               text generation architectures exist [21], we demonstrate that prop-
                                                                                             erties of the linear vector space produced by "shallow" networks are
CCS CONCEPTS                                                                                 of utility to this recommendation task. Our recommendations are
                                                                                             made with only a single explicit course preference given by the user,
• Applied computing → Education; • Information systems
                                                                                             as opposed to the entire course selection history needed by session-
→ Recommender systems.
                                                                                             based Recurrent Neural Network approaches [8]. Single example,
KEYWORDS                                                                                     also known as "one-shot," generalization is common in the vision
                                                                                             community, which has pioneered approaches to extrapolating a
Higher education, course guidance, filter bubble, neural networks                            category from a single labeled example [7, 22]. Other related work
                                                                                             applying skip-grams to non-linguistic data include node embed-
1     INTRODUCTION                                                                           dings learned from sequences of random walks of graphs [19] and
Among the institutional values of a liberal arts university is to                            product embeddings learned from ecommerce clickstream [2]. Our
expose students to a variety of perspectives expressed in courses                            work, methodologically, adds rigor to this approach by tuning the
across its various physical and intellectual schools of thought. Col-                        model against validation sets created from institutional knowledge
laborative filtering based sequence prediction methods, in this envi-                        and curated by the university.
ronment, can provide personalized course recommendations based                                  We conduct a user study (N = 70) of undergraduates at the Univer-
on temporal models of normative behavior [14] but are not well                               sity to evaluate their personalized course recommendations made
suited for surfacing courses a student may find interesting but                              by our models designed for serendipity and by the RNN-based
which have been relatively unexplored by those with similar course                           engine, which previously drove recommendations in the system.
selections to them in the past. Therefore, a more diversity oriented                         The findings underscore the tension between unexpectedness and
model can serve as an appropriate compliment to recommendations                              successfulness and show the deficiency of RNNs for producing
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                             novel recommendations. While our course2vec based model scored
License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop on Interfaces   68% above bag-of-words in accuracy in one of our course analogy
and Human Decision Making for Recommender Systems, 19 Sept 2019, Copenhagen, DK              validation set, simple bag-of-words scored slightly higher in the
.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                     Pardos and Jiang


main objective of user perceived serendipity. A potential reason
for this discrepancy is the nature of information presented to stu-
dents in the recommender system interface. Catalog descriptions
of recommended courses were shown to students, which served
as the only source of information they could consult in deciding if
they wanted to take the course. A generated explanation, or prior-
itization of the course2vec recommendation in the interface may
be required to overcome the advantage of the bag-of-words model
being based on the same information being shown to them in the
recommendations.
    Recommender systems in higher education contexts have re-
cently focused on prediction of which courses a student will take
[14, 17] or the grade they will receive if enrolled [9, 18]. At Stanford,
a system called "CARTA" allows students to see grade distributions,
course evaluations, and the most common courses taken before a
course of interest [1]. At UC Berkeley, the recommender system
being modified in this study serves students next-semester course                       Figure 1: multi-factor course2vec model
considerations based on their personal course enrollment history
[14]. Earlier systems included a focus on requirement satisfaction
[13] and career-based relevancy recommendation [6]. No system
has yet focused on serendipitous or novel course discovery.                 course. In this section, we consider adding features of a course to
                                                                            the input to enhance the classifier and its representations, as shown
                                                                            in Figure 1. Each course is taught by one or several instructors
2     MODELS AND METHODOLOGY
                                                                            over the years and is associated with an academic department.
This section introduces three competing models used to generate             The multi-factor course2vec model learns both course and course
our representations. The first model uses course2vec [14] to learn          feature representations by maximizing the objective function over
course representations from enrollment sequences. Our second                all the students’ enrollment sequences and the features of courses.
model is a variant on course2vec, which learns representations of           Full technical details can be found in [15].
explicitly defined features of a course (e.g., instructor or department)        In language models, two word vectors will be cosine similar if
in addition to the course representation. The intuition behind this         they share similar sentence contexts. Likewise, in the university
approach is that the course representation could have, conflated            domain, courses that share similar co-enrollments, and similar pre-
in it, the influence of the multiple-instructors that have taught           vious and next semester enrollments, will likely be close to one
the course over time. We contend that this "deconflation" may               another in the vector space.
increase the fidelity of the course representation and serve as a
more accurate representation of the topical essence of the course.
                                                                            2.2     Bag-of-Words
The last representation model is a standard bag-of-words vector,
constructed for each course strictly from its catalog description.          A simple but indelible approach to item representation has been to
Finally, we explore concatenating a course’s course2vec and bag-of-         create a vector, the length of the number of unique words across all
words representation vector.                                                items, with a non-zero value if the word in the vocabulary appears
                                                                            in it. Only unigram words are used to create this unordered vector
                                                                            list of words used to represent the document [3].
2.1    Course2vec                                                               The basic methodology based on bag-of words proposed by IR
The course2vec model involves learning distributed representations          researchers for text corpora - a methodology successfully deployed
of courses from students’ enrollment records throughout semesters           in modern Internet search engines - reduces each document in the
by using a notion of an enrollment sequence as a "sentence" and             corpus to a vector of real numbers, each of which represents a term
courses within the sequence as "words", borrowing terminology               weight. The term weight might be:
from the linguistic domain. For each student s, a chronological
course enrollment sequence is produced by first sorting by semes-                 • a term frequency value indicating how many times the term
ter then randomly serializing within-semester course order. Then,                   occurred in the document.
each course enrollment sequence is used in training, similar to a                 • a binary value with 1 indicating that the term occurred in
document in a classical skip-gram application.                                      the document, and 0 indicating that it did not.
   The training objective of the skip-gram model is to find word                  • tf-idf scheme [4], the product of term frequency and inverse
representations that are useful for predicting the surrounding words                document frequency, which increases proportionally to the
in a sentence or a document. Each word in the corpus is used as an                  number of times a word appears in the document and is
input to a log-linear classifier with continuous projection layer, to               offset by the frequency of the word in the corpus and helps
predict words within a certain range before and after the current                   to adjust for the fact that some words appear more frequently
word. Therefore, the skip-gram model can be also deemed as a                        in general.
classifier with input as a target course and output as a context            We evaluate all three variants in our quantitative validation testing.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                                      Pardos and Jiang


2.3     Surfacing Serendipitous Recommendations                                                    similarity based ground truth. A course is paired with an-
        from Course Representations                                                                other course in this set if a student can only receive credit for
                                                                                                   taking one of the courses. For example, an honors and non-
We surface recommendations intended to be interesting but unex-
                                                                                                   honors version of a course will be appear as a pair because
pected by finding an objective course c j which is most similar to
                                                                                                   faculty have deemed that there is too much overlapping ma-
a student’s favorite course c i , diversifying the results by allowing
                                                                                                   terial between the two for a student to receive credit for
only one result per department d j :
                                                                                                   both.
                            c ∗j = arg max cos(c, c i )                           (1)    • Analogy validation set: The standard method for validating learned
                                  c,d (c)=d j
                                                                                           word vectors has been to use analogy to test the degree to which the
where d(c) means the the department of course c. Then all the                              embedding structure contains semantic and syntactic relationships
counterpart courses c ∗j in all the other departments will be ranked                       constructed from prior knowledge. In the domain of university
according to cos(c ∗j , c i ), where j = 1, 2..., k. We can apply both neu-                courses, we use course relationship pairs constructed from prior
ral representations and bag-of-words representations of courses in                         work using first-hand knowledge of the courses [16]. The 77 rela-
this method to generate the most similar courses in each depart-                           tionship pairs were in five categories; online, honors, mathematical
ment.                                                                                      rigor, 2-department topics, and 3-department topics. An example of
                                                                                           an "online" course pair would be Engineering 7 and its online coun-
3 EXPERIMENTAL ENVIRONMENTS                                                                terpart, Engineering W7 or Education 161 and W161. An analogy
                                                                                           involving these two paris could be calculated as: Engineering 7W -
3.1 Off-line Dataset                                                                       Engineering 7 + Education 161 ≈ EducationW 161.
We used a dataset containing anonymized student course enroll-
ments at UC Berkeley from Fall 2008 through Fall 2017. The dataset                         3.2    Online Environment (System Overview)
consists of per-semester course enrollment records for 164,196 stu-                        The production recommender system at UC Berkeley uses a stu-
dents (both undergraduates and graduates) with a total of 4.8 million                      dent data pipeline with the enterprise data warehouse to keep
enrollments. A course enrollment record means that the student                             up-to-date enrollment histories of students. Upon CAS login, these
was still enrolled in the course at the end of the semester. Students at                   histories are associated with the student and passed through an
this university, during this period, were allowed to drop courses up                       RNN model, which cross-references the output recommendations
until close to the end of the semester without penalty. The median                         with the courses offered in the target semester. Class availability
course load during students’ active semesters was four. There were                         information is retrieved during the previous semester from a cam-
9,478 unique lecture courses from 214 departments1 hosted in 17                            pus API once the registrar has released the schedule. The system is
different Divisions of 6 different Colleges. Course meta-information                       written with an AngularJS front-end and python back-end service
contains course number, department name, total enrollment and                              which loads the machine learned models written in pyTorch. These
max capacity. In this paper, we only consider lecture courses with                         models are version controlled on github and refreshed three times
at least 20 enrollments total over the 9-year period, leaving 7,487                        per semester after student enrollment status refreshes from the
courses. Although courses can be categorized as undergraduate                              pipeline. The system receives traffic of around 20% of the under-
courses and graduate courses, undergraduates are permitted to                              graduate student body, partly from the UC Berkeley Registrar’s
enroll in many graduate courses no matter their status.                                    website.
   Enrollment data were sourced from the campus enterprise data
warehouse with course descriptions sourced from the official cam-                          4     VECTOR MODEL REFINEMENT
pus course catalog API. We pre-processed the course description
data in the following steps: (1) removing generic, often-seen sen-
                                                                                                 EXPERIMENTS
tences across descriptions (2) removing stop words (3) removing                            In this section, we first introduce our experiment parameters and
punctuation (4) word lemmatization and stemming, and finally                               the ways we validated the representations quantitatively. Then, we
tokenizing the bag-of-words in each course description. We then                            describe the various ways in which we refined the models and the
compiled the term frequency vector, binary value vector, and tf-idf                        results of these refinement.
vector for each course.
                                                                                           4.1    Model Evaluations
3.1.1 Semantic Validation Sets. In order to quantitatively evaluate
                                                                            We trained the models described in Section 2.1 on the student
how accurate the vector models are, a source of ground truth on
                                                                            enrollment records data. Specifically, we added the instructor(s)
the relationships between courses needed to brought to bear to
                                                                            who teach the course and the course department as two input
see the degree to which the vector representations encoded this
                                                                            features of courses in the multi-factor course2vec model.
information. We used two such sources of ground truth to serve as
                                                                               To evaluate course vectors on the course equivalency validation
validation sets, one providing information on similarity, the other
                                                                            set, we fixed the first course in each pair and ranked all the other
on a variety of semantic relationships.
                                                                            courses according to their cosine similarity to the first course in
     • Equivalency validation set: A set of 1,351 course credit-equivalency descending order. We then noted the rank of the expected second
       pairs maintained by the Office of the Registrar were used for        course in the pair and described the performance of each model
1 At UC Berkeley, the smallest academic unit is called a "subject." For the purpose of     on all validation pairs in terms of mean rank, median rank and
communicability, we instead refer to subjects as departments.                              recall@10.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                    Pardos and Jiang


   For evaluation of the course analogy validation set, we followed      0.5647). While the median rank of the concatenated model only im-
the analogy paradigm of: course2 − course1 + course3 ≈ course4.          proved one rank, from 4 to 3, the mean rank improved dramatically
Courses were ranked by their cosine similarity to course2−course1+       (from 566 to 132), and is the best of all models tested in terms of
course3. An analogy completion is considered accurate (a hit) if         mean rank. Non-normalized vectors did not show improvements
the first ranked course is the expected course4 (excluding the other     over bag-of-words alone in median rank and recall@10. Improve-
three from the list). We calculated the average accuracy (recall@1)      ments in the analogy test were more mild, with a recall@10 of
and the recall@10 over all the analogies in the analogy validation       0.8788 for the best concatenated model, combining binary bag-of-
set.                                                                     words with multi-factor course2vec, compared with 0.8557 for the
                                                                         best course2vec only model. Normalization in the case of analogies
4.2    Course2vec vs. Multi-factor Course2vec                            hurt all model performance, the opposite of what was observed in
We compared the pure course2vec model with the course represen-          the equivalency test. This suggests that normalization improves
tations from the multi-factor course2vec model using instructor,         local similarity but may act to degrade the more global structure of
department, and both as factors. Full results of evaluation on the       the vector space.
equivalency validation and analogy validation are shown in [15].
The multi-factor model outperformed the pure course2vec model
                                                                         5     USER STUDY
in terms of recall@10 in both validation sets, with the combined
instructor and department factor model performing the best.              A user study was conducted to evaluate the quality of recommen-
                                                                         dations drawn from our different course representations. Users
4.3    Bag-of-words vs. Multi-factor Course2vec                          rated each course from each recommendation algorithm along five
                                                                         dimensions of quality. Students were asked to rate course recom-
Among the three bag-of-words models, tf-idf performed the best in
                                                                         mendations in terms of their (1) unexpectedness (2) successfulness
all equivalency set metrics. The median rank (best=4) and recall@10
                                                                         / interest in taking the course (3) novelty (4) diversity of the results
(best=0.5647) for the bag-of-words models were also substantially
                                                                         (5) and identifiable commonality among the results. In Shani and
better than the best course2vec models, which had a best median
                                                                         Gunawardana [20], authors defined serendipity as the combination
rank of 15 with best recall@10 of 0.4485 for the multi-factor in-
                                                                         of "unexpectedness" and "success." In the case of a song recom-
structor and department model. All course2vec models; however,
                                                                         mender, for example, success would be defined as the user listening
showed better mean rank performance (best=224) compared with
                                                                         to the recommendation. In our case, we use a student’s expression
bag-of-words (best=566). This suggests that there are many outliers
                                                                         of interest in taking the course as a proxy for success. The mean
where literal semantic similarity (bag-of-words) is very poor at
                                                                         of their unexpectedness and successfulness rating will comprise
identifying equivalent pairs, whereas course2vec has much fewer
                                                                         our measure of serendipity. We evaluated three of our developed
near worst-case examples. This result is consistent with prior work
                                                                         models, all of which displayed 10 results, only showing one course
comparing pure course2vec models to binary bag-of-words [14].
                                                                         per department in order to increase diversity (and unexpected-
    When considering performance on the analogy validation, the
                                                                         ness). The models were (1) the best BOW model (tf-idf), (2) the best
roles are reversed, with all course2vec models performing better
                                                                         Analogy validation model (binary BOW + multi-factor course2vec
than the bag-of-words models in both accuracy and recall@10. The
                                                                         normalized), (3) and the best Equivalency validation model (tf-idf +
difference in recall of bag-of-words compared to course2vec when
                                                                         multi-factor course2vec non-normalized). To measure the impact
it comes to analogies is substantial (0.581 vs 0.8557), a considerably
                                                                         our department diversification filter would have on serendipity,
larger difference than between bag-of-words and course2vec on
                                                                         we added a version of the best Equivalency model that did not
equivalencies (0.5647 vs 0.4485). Again, the multi-factor instructor
                                                                         impose this filter, allowing multiple courses to be displayed from
and department model and tf-idf were the best models in their
                                                                         the same department if they were the most cosine similar to the
respective class. These analyses establish that bag-of-words models
                                                                         user’s specified favorite course. Our fifth comparison recommen-
are moderately superior in capturing course similarity, but are
                                                                         dation algorithm was the system’s existing collaborative-filtering
highly inferior to enrollment-based course2vec models in the more
                                                                         based Recurrent Neural Network (RNN) that recommends courses
complex task of analogy completion.
                                                                         based on a prediction of what the student is likely to take next
                                                                         given their personal course history and what other students with a
4.4    Combining Bag-of-words and Course2vec                             similar history have taken in the past [14]. All five algorithms were
       Representations                                                   integrated into a real-world recommender system for the purpose
In light of the strong analogy performance of course2vec and strong      of this study and evaluated by 70 undergraduates at the University.
equivalency performance bag-of-words in the previous section,
we concatenated the multi-factor course2vec representations with
bag-of-words representations. To address the different magnitudes        5.1    Study Design
in the vectors between the two concatenated representations, we          Undergraduates were recruited from popular University associated
create a normalized version of each vector set for comparison to         Facebook groups and asked to sign-up for a one hour evaluation
non-normalized sets.                                                     session. Since they would need to specify a favorite course they had
   We found that the normalized concatenation of tf-idf with multi-      taken, we restricted participants to those who had been at the Uni-
factor course2vec performed substantially better on the equivalency      versity at least one full semester and were currently enrolled. The
test than the previous best model in terms of recall@10 (0.6435 vs.      study was run at the beginning of the Fall semester, while courses
IntRS Workshop, September 2019, Copenhagen, DK                                                                                                   Pardos and Jiang


could still be added and dropped and some students were still shop-
ping for courses. We used a within-subjects design whereby each
volunteer rated ten course recommendations made by each of the
five algorithms. Because of the considerable number of ratings
expected ([3*10+2]*5 = 160) and the importance for students to
carefully consider each recommended course, in-person sessions
were decided on over asynchronous remote sessions in order to
better encourage on-task behavior throughout the session. Student
evaluators were compensated with a $40 gift card to attend one of
four sessions offered across three days with a maximum occupancy
of 25 in each session. A total of 702 students participated.
   We began the session by introducing the evaluation motivation
as a means for students to help inform the choice of algorithm
that we will use for a future campus-wide deployment of a course
exploration tool. Students started the evaluation by visiting a sur-
vey URL that asked them to specify a favorite course they had
taken at the University. This favorite course was used by the first
four algorithms to produce 10 course recommendations each. Each
recommended course’s department, course number, title, and full
                                                                                                             Figure 2: User survey page
catalog description were displayed to the student in the interface.
There was a survey page (Figure 2) for each algorithm in which
students were asked to read the recommended course descriptions
carefully and then rate each of the ten courses individually on a                         algorithms, there were no statistically significant differences except
five point Likert scale agreement with the following statements:                          for BOW scoring higher than Equivalency (div) on unexpectedness
(1) This course was unexpected (2) I am interested in taking this                         and scoring higher than both Equivalency (div) and Analogy (div)
course (3) I did not know about this course before. These ratings                         on novelty. Among the two non-diversity algorithms, there were
respectively measured unexpectedness, successfulness, and novelty.                        no statistically significant differences except for the RNN scoring
After rating the individual courses, students were asked to rate                          higher on diversity and Equivalency (non-div) recommendations
their agreement with the following statements pertaining to the                           scoring higher on novelty. With respect to measures of serendipity,
10 results as a whole: (1) Overall, the course results were diverse                       the div and non-div algorithms had similar scores among their re-
(2) The course results shared something in common with my fa-                             spective strengths (3.473-3.619); however, the non-div algorithms
vorite course. These ratings measured dimensions of diversity and                         scored substantially lower in their weak category of unexpectedness
commonality. Lastly, students were asked to provide an optional                           (2.091 & 2.184) than did the div algorithms in their weak category of
follow-up open text response to the question, "If you identified                          successfulness (2.851-2.999), resulting in statistically significantly
something in common with your favorite course, please explain it                          higher serendipity scores for the div algorithms.
here." On the last page of the survey, students were asked to specify                        The most dramatic difference can be seen in the measure of nov-
their major, year, and to give optional open response feedback on                         elty, where BOW (div) scored 3.896 and the system’s existing RNN
their experience. Graduate courses were not included in the recom-                        (non-div) scored 1.824, the lowest rating in the results matrix. The
mendations and the recommendations were not limited to courses                            proportion of each rating level given to the two algorithms on this
available in the current semester.                                                        question is shown in Figures 3 and 5. Hypothetically, an algorithm
                                                                                          that recommended randomly selected courses would score high
5.2      Results                                                                          in both novelty and unexpectedness, and thus it is critical to also
Results of average student ratings of the five algorithms across                          weigh their ability to recommend courses that are also of interest
the six measurement categories are shown in Table 1. The diver-                           to students. Figure 4 shows successfulness ratings for each of the
sity based algorithms, denoted by "(div)," all scored higher than                         algorithms aggregated by rank of the course result. The non-div
the non-diversity (non-div) algorithms in unexpectedness, novelty,                        algorithms, shown with dotted lines, always perform as well or
diversity, and the primary measure of serendipity. The two non-                           better than the div algorithms at every rank. The more steeply de-
diversity based algorithms; however, both scored higher than the                          clining slope of the div algorithms depicts the increasing difficulty
other three algorithms in successfulness and commonality. All pair-                       of finding courses of interest across different departments. The ten-
wise differences between diversity and non-diversity algorithms                           sion between the ability to recommend courses of interest that are
were statistically significant, using the p < 0.001 level after applying                  also unexpected is shown in Figure 6, where the best serendipitous
a Bonferoni correction for multiple (60) tests. Within the diversity                      model, BOW (div), recommends a top course of higher success-
                                                                                          fulness than unexpectedness, with the two measures intersecting
2 Due to an authentication bug during the fourth session, twenty participating students   at rank 2 and diverging afterwards. The best equivalency model,
were not able to access the collaborative recommendations of the fifth algorithm. RNN     combining course description tf-idf and course2vec (non-div), main-
results in the subsequent section are therefore based on the 50 students from the first
three sessions. When paired t-tests are conducted between RNN and the ratings of          tains high successfulness but also maintains low unexpectedness
other algorithms, the tests are between ratings among these 50 students.                  across the 10 course recommendation ranks.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                          Pardos and Jiang


    Table 1: Average student ratings of recommendations from the five algorithms across the six measurement categories.

                      algorithm                  unexpectedness   successfulness    serendipity   novelty   diversity   commonality
                      BOW (div)                  3.550            2.904             3.227         3.896     4.229       3.229
                      Analogy (div)              3.473            2.851             3.162         3.310     4.286       2.986
                      Equivalency (div)          3.297            2.999             3.148         3.323     4.214       3.257
                      Equivalency (non-div)      2.091            3.619             2.855         2.559     2.457       4.500
                      RNN (non-div)              2.184            3.566             2.875         1.824     3.160       4.140




Figure 3: Student novelty rating proportions of course rec-                        Figure 5: Student novelty rating proportions of course rec-
ommendations produced by BOW (div)                                                 ommendations produced by RNN (non-div)




                                                                                   Figure 6: BOW (div) vs. Equivalency (non-div) comparison


                                                                                   5.3   Qualitative Characterization of Algorithms
              Figure 4: Successfulness comparison                                  In this section, we attempt to synthesize qualitative characteriza-
                                                                                   tions of the different algorithms by looking at the open responses
                                                                                   students gave to the question asking them to describe any common-
                                                                                   alities they saw among recommendations made by each algorithm
   Are more senior students less likely to rate courses as novel or
                                                                                   to their favorite course.
unexpected, given they have been at the University longer and been
exposed to more courses? Among our sophomore (27), junior (22),                    5.3.1 BOW (div). Several students remarked positively about rec-
and senior (21) level students, there were no statistically significant            ommendations matching to the themes of "art, philosophy, and
trends among the six measures, except for a marginally significant                 society" or "design" exhibited in their favorite course. The word
trend (p = 0.007, shy of the p < 0.003 threshold given the Bonferroni              "language" was mentioned by 14 of the 61 respondents answering
correction) of more senior students rating recommendations as less                 the open response question. Most of these comments were negative,
unexpected (avg = 2.921) than juniors (avg = 3.024), whose ratings                 pointing out the limitations of similarity matching based solely
were not statistically separable from sophomores (avg = 3.073).                    on literal course description matching. The most common critique
IntRS Workshop, September 2019, Copenhagen, DK                                                                                      Pardos and Jiang


given in this category was of the foreign spoken language courses         enrolled in. Due to the institutional data refresh schedule, student
that showed up at the lower ranks when students specified a fa-           current enrollments are not known until after the add/drop deadline.
vorite course involving programming languages. Other students             This may be a shortcoming that can be rectified in the future.
remarked at additional dissimilarity when specifying a favorite
course related to cyber security, receiving financial security courses    6   FEATURE RE-DESIGN
in the results.
                                                                          As a result of the feedback received from the user study, we worked
5.3.2 Analogy (div). The word "interesting" appeared in seven of          with campus to pull down real-time information on student require-
the 54 comments left by students to describe commonalities among          ment satisfaction from the Academic Plan Review module of the
the analogy validation optimized algorithm. This word was not             PeopleSoft Student Information System. We re-framed the RNN fea-
among the top 10 most frequent words in any of the other four             ture as a "Requirements" satisfying feature that, upon log-in, shows
algorithms. Several students identified broad themes among the            students their personalized list of unsatisfied requirements (Figure
courses that matched to their favorite course, such as "identity" and     8). After selecting a requirement category to satisfy, the system
"societal development." On the other end of the spectrum, one stu-        displays courses which satisfy the selected requirement and are of-
dent remarked that the results "felt weird" and were only "vaguely        fered in the target semester. The list of courses is sorted by the RNN
relevant." Another student stated that, "the most interesting sugges-     to represent the probability that students like them will take the
tion was the Introduction to Embedded Systems [course] which is           class. This provides a signal to the student of what the normative
just different enough from my favorite course that it’s interesting       course taking behavior is in the context of requirement satisfaction.
but not too different that I am not interested," which poignantly ar-     For serendipitous suggestions, we created a separate "Explore" tab
ticulates the crux of difficulty in striking a balance between interest   (Figure 7) using the BOW (div) model to surface the top five courses
and unexpectedness to achieve a serendipitous recommendation.             similar across departments, due to its strong serendipitous and nov-
5.3.3 Equivalency (div). Many students (seven of the 55) remarked         elty ratings. The Equivalency (non-div) model was used to display
positively on the commonality of the results with themes of data          an additional most similar five courses within the same department.
exhibited by their favorite course (in most cases STATS C8, an in-        This model was chosen due to its strong successfulness ratings.
troductory data science course). They mentioned how the courses
all involved "interacting with data in different social, economic,        7   DISCUSSION
and psychological contexts" and "data analysis with different ap-         Surfacing courses that are of interest but not known before means
plications." One student remarked on this algorithm’s tendency to         expanding a student’s knowledge and understanding of the Univer-
match at or around the main topic of the favorite course, further re-     sity’s offerings. As students are exposed to courses that veer further
marking that "they were relevant if looking for a class tangentially      from their home department and nexus of interest and understand-
related."                                                                 ing, recommendations become less familiar with descriptions that
5.3.4 Equivalency (non-div). This algorithm was the same as the           are harder to connect with. This underscores the difficulty of pro-
above, except that it did not limit results to one course per depart-     ducing an unexpected but interesting course suggestion, as it often
ment. Because of this lack of department filter, 15 of the 68 students    must represent a recommendation of uncommon wisdom in order
submitting open text responses to the question of commonality             to extend outside of a student’s zone of familiarity surrounding
pointed out that the courses returned were all from the same de-          their centers of interest. Big data can be a vehicle for, at times,
partment. Since this model scored highest on a validation task of         reaching that wisdom. Are recommendations useful when they
matching to a credit equivalent course pair (almost always in the         suggest something expected or already known? Two distinct sets of
same department), it is not surprising that students observed that        responses to this question emerged from student answers to the last
results from this algorithm tended to all come from the department        open ended feedback question. One representative remark stated,
of the favorite course, which also put it close to their nexus of
                                                                                    "The best algorithms were the ones that had more
interest.
                                                                                    diverse options, while still staying true to the core
5.3.5 RNN (non-div). The RNN scored lowest in novelty, signifi-                     function of the class I was searching. The algo-
cantly lower than the other non-div algorithm, and was not signifi-                 rithms that returned classes that were my major
cantly different from the other non-div algorithm in successfulness.                requirements/in the same department weren’t as
In this case, what is the possible utility of the collaborative-based               helpful because I already knew of their existence
RNN over the non-div Equivalency model? Many of the 47 (of 50)                      as electives I could be taking"
student answers to the open response commonality question re-
marked that the recommendations related to their major (mentioned         While a different representative view was expressed with,
by 21 students) and contained courses that fulfilled a requirement
                                                                                    "I think the fifth algorithm [RNN] was the best fit
(mentioned by seven) as the distinguishing signature of this algo-
                                                                                    for me because my major is pretty standardized"
rithm. Since the RNN is based on normative next course enrollment
behavior, it is reasonable that it suggested many courses that satisfy    These two comments make a case for both capabilities being of
an unmet requirement. This algorithm’s ability to predict student         importance. They are also a reminder of the desire among young
enrollments accurately became a detriment to some, as seven re-           adults for the socio-technical systems of the university to offer a
marked that it was recommending courses that they were currently          balance of information, exploration and, at times, guidance.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                             Pardos and Jiang




                                                                                      Figure 8: The “Requirements" Interface


                                                                         utility of the neural embedding is that students had to rely on the
                                                                         course description semantics in order to familiarize themselves
                                                                         with the suggested course and determine if they were interested
                                                                         in taking it. If a concept was detected by the neural embedding
                                                                         but not the BOW, this likely meant that the concept was difficult
                                                                         to pick-up from the course description displayed to students. Past
                                                                         work has shown that users evaluate media recommendations less
                                                                         favorably before they take the recommendation than after when im-
                                                                         portant aspects of the recommended content is not described in the
                                                                         recommendation [11]. Future work could augment recommended
                                                                         course descriptions with additional information, including latent
                                                                         semantics inferred from enrollments [5] or additional semantics
                                                                         retrieved from available course syllabi.

                                                                         ACKNOWLEDGMENTS
                                                                         This work was partly supported by the United States National
                                                                         Science Foundation (1547055/1446641) and the National Natural
                                                                         Science Foundation of China (71772101/71490724).

                                                                         REFERENCES
                                                                         [1] Sorathan Chaturapruek, Thomas Dee, Ramesh Johari, René Kizilcec, and Mitchell
                                                                             Stevens. 2018. How a data-driven course planning tool affects college students’
                                                                             GPA: evidence from two field experiments. (2018).
                Figure 7: The “Explore" Interface                        [2] Hung-Hsuan Chen. 2018. Behavior2Vec: Generating Distributed Representations
                                                                             of UsersâĂŹ Behaviors on Products for Recommender Systems. ACM Transactions
                                                                             on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 43.
                                                                         [3] D Manning Christopher, Raghavan Prabhakar, and Schacetzel Hinrich. 2008.
                                                                             Introduction to information retrieval. An Introduction To Information Retrieval
8    LIMITATIONS                                                             151, 177 (2008), 5.
The more distal a course description is, even if conceptually similar,   [4] Martin Dillon. 1983. Introduction to modern information retrieval: G. Salton and
                                                                             M. McGill. McGraw-Hill, New York (1983). 448 pp., ISBN 0-07-054484-0.
the less a student may be able to recognize the commonality with         [5] Matt Dong, Run Yu, and Zach A. Pardos. in press. Design and Deployment of a
a favorite course. A limitation of our study in demonstrating the            Better Course Search Tool: Inferring latent keywords from enrollment networks.
IntRS Workshop, September 2019, Copenhagen, DK                                                                                                                Pardos and Jiang


     In Proceedings of the 14th European Conference on Technology Enhanced Learning.          personalized course guidance. User Modeling and User-Adapted Interaction 29, 2
     Springer.                                                                                (2019), 487–525.
 [6] Rosta Farzan and Peter Brusilovsky. 2011. Encouraging user participation in a       [15] Zachary A Pardos and Weijie Jiang. 2019. Combating the Filter Bubble: Designing
     course recommender system: An impact on user behavior. Computers in Human                for Serendipity in a University Course Recommendation System. arXiv preprint
     Behavior 27, 1 (2011), 276–284.                                                          arXiv:1907.01591 (2019).
 [7] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object        [16] Zachary A Pardos and Andrew Joo Hun Nam. 2018. A Map of Knowledge. CoRR
     categories. IEEE transactions on pattern analysis and machine intelligence 28, 4         preprint, abs/1811.07974 (2018). https://arxiv.org/abs/1811.07974
     (2006), 594–611.                                                                    [17] Agoritsa Polyzou, N Athanasios, and George Karypis. 2019. Scholars Walk: A
 [8] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos                    Markov Chain Framework for Course Recommendation. In Proceedings of the
     Tikk. 2016. Parallel recurrent neural network architectures for feature-rich             12th International Conference on Educational Data Mining. 396–401.
     session-based recommendations. In Proceedings of the 10th ACM Conference on         [18] Zhiyun Ren, Xia Ning, Andrew S Lan, and Huzefa Rangwala. 2019. Grade
     Recommender Systems. ACM, 241–248.                                                       Prediction Based on Cumulative Knowledge and Co-taken Courses. In Proceedings
 [9] Weijie Jiang, Zachary A Pardos, and Qiang Wei. 2019. Goal-based course rec-              of the 12th International Conference on Educational Data Mining. 158–167.
     ommendation. In Proceedings of the 9th International Conference on Learning         [19] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec:
     Analytics & Knowledge. ACM, 36–45.                                                       Learning node representations from structural identity. In Proceedings of the 23rd
[10] Judy Kay. 2000. Stereotypes, student models and scrutability. In International           ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
     Conference on Intelligent Tutoring Systems. Springer, 19–30.                             ACM, 385–394.
[11] Benedikt Loepp, Tim Donkers, Timm Kleemann, and Jürgen Ziegler. 2018. Impact        [20] Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems.
     of item consumption on assessment of recommendations in user studies. In                 In Recommender systems handbook. Springer, 257–297.
     Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 49–53.          [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[12] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A                Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
     Konstan. 2014. Exploring the filter bubble: the effect of using recommender              you need. In Advances in neural information processing systems. 5998–6008.
     systems on content diversity. In Proceedings of the 23rd international conference   [22] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-
     on World wide web. ACM, 677–686.                                                         ing networks for one shot learning. In Advances in Neural Information Processing
[13] Aditya Parameswaran, Petros Venetis, and Hector Garcia-Molina. 2011. Rec-                Systems. 3630–3638.
     ommendation systems with complex constraints: A course recommendation               [23] Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. 2012.
     perspective. ACM Transactions on Information Systems (TOIS) 29, 4 (2011), 20.            Auralist: introducing serendipity into music recommendation. In Proceedings
[14] Zachary A Pardos, Zihao Fan, and Weijie Jiang. 2019. Connectionist recom-                of the fifth ACM international conference on Web search and data mining. ACM,
     mendation in the wild: on the utility and scrutability of neural networks for            13–22.