<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Designing for Serendipity in a University Course Recommendation System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zachary Pardos</string-name>
          <email>zp@berkeley.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weijie Jiang</string-name>
          <email>jiangwj.14@sem.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tsinghua University &amp; University of California</institution>
          ,
          <addr-line>Berkeley</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>Berkeley</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Collaborative filtering based algorithms, including Recurrent Neural Networks (RNN), tend towards predicting a perpetuation of past observed behavior. In a recommendation context, this can lead to an overly narrow set of suggestions lacking in serendipity and inadvertently placing the user in what is known as a "filter bubble." In this paper, we grapple with the issue of the filter bubble in the context of a course recommendation system in production at a public university. Our approach is to present course results that are novel or unexpected to the student but still relevant to their interests. We build one set of models based on the course catalog description (BOW) and another set informed by enrollment histories (course2vec). We compare the performance of these models on of-line validation sets and against the system's existing RNN-based recommendation engine in a user study of undergraduates (N = 70) who rated their course recommendations along six characteristics related to serendipity. Results of the user study show a dramatic lack of novelty in RNN recommendations and depict the characteristic trade-ofs that make serendipity dificult to achieve. While the machine learned course2vec models performed best on concept generalization tasks (i.e, course analogies), it was the simple bag-of-words based recommendations that students rated as more serendipitous. We discuss the role of the recommendation interface and the information presented therein in the student's decision to accept a recommendation from either algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Applied computing → Education; • Information systems
→ Recommender systems.
Higher education, course guidance, filter bubble, neural networks</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Among the institutional values of a liberal arts university is to
expose students to a variety of perspectives expressed in courses
across its various physical and intellectual schools of thought.
Collaborative filtering based sequence prediction methods, in this
environment, can provide personalized course recommendations based
on temporal models of normative behavior [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] but are not well
suited for surfacing courses a student may find interesting but
which have been relatively unexplored by those with similar course
selections to them in the past. Therefore, a more diversity oriented
model can serve as an appropriate compliment to recommendations
made from collaborative based methods. This problem of training
on the past without necessarily repeating it is an open problem
in many collaborative filtering based recommendation contexts,
particularly social networks, where, in the degenerate cases, users
can get caught in "filter bubbles," or model-based user stereotypes,
leading to a narrowing of item recommendation variety [
        <xref ref-type="bibr" rid="ref10 ref12 ref23">10, 12, 23</xref>
        ].
      </p>
      <p>
        To counteract the filter bubble, we introduce a course2vec
variant into a production recommender system at a public university
designed to surface serendipitous course suggestions. Course2vec
applies a skip-gram to course enrollment histories, instead of
natural language, in order to learn representations. We use the
definition of serendipity as user perceived unexpectedness of result
combined with successfulness [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which we define as a course
recommendation a student expresses interest in taking. At many
universities, conceptually similar courses exist across departments
but use widely difering disciplinary vernacular in their catalog
descriptions, making them dificult for learners to search for and
to realize their commonality. We propose that by tuning a vector
representation of courses learned from nine years of enrollment
sequences, we can capture enough implicit semantics of the courses
to more abstractly, and accurately construe similarity. To encourage
the embedding to learn features that may generalize across
departments, our skip-gram variant simultaneously learns department
(and instructor) embeddings. While more advanced attention-based
text generation architectures exist [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we demonstrate that
properties of the linear vector space produced by "shallow" networks are
of utility to this recommendation task. Our recommendations are
made with only a single explicit course preference given by the user,
as opposed to the entire course selection history needed by
sessionbased Recurrent Neural Network approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Single example,
also known as "one-shot," generalization is common in the vision
community, which has pioneered approaches to extrapolating a
category from a single labeled example [
        <xref ref-type="bibr" rid="ref22 ref7">7, 22</xref>
        ]. Other related work
applying skip-grams to non-linguistic data include node
embeddings learned from sequences of random walks of graphs [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and
product embeddings learned from ecommerce clickstream [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our
work, methodologically, adds rigor to this approach by tuning the
model against validation sets created from institutional knowledge
and curated by the university.
      </p>
      <p>We conduct a user study (N = 70) of undergraduates at the
University to evaluate their personalized course recommendations made
by our models designed for serendipity and by the RNN-based
engine, which previously drove recommendations in the system.
The findings underscore the tension between unexpectedness and
successfulness and show the deficiency of RNNs for producing
novel recommendations. While our course2vec based model scored
68% above bag-of-words in accuracy in one of our course analogy
validation set, simple bag-of-words scored slightly higher in the
main objective of user perceived serendipity. A potential reason
for this discrepancy is the nature of information presented to
students in the recommender system interface. Catalog descriptions
of recommended courses were shown to students, which served
as the only source of information they could consult in deciding if
they wanted to take the course. A generated explanation, or
prioritization of the course2vec recommendation in the interface may
be required to overcome the advantage of the bag-of-words model
being based on the same information being shown to them in the
recommendations.</p>
      <p>
        Recommender systems in higher education contexts have
recently focused on prediction of which courses a student will take
[
        <xref ref-type="bibr" rid="ref14 ref17">14, 17</xref>
        ] or the grade they will receive if enrolled [
        <xref ref-type="bibr" rid="ref18 ref9">9, 18</xref>
        ]. At Stanford,
a system called "CARTA" allows students to see grade distributions,
course evaluations, and the most common courses taken before a
course of interest [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. At UC Berkeley, the recommender system
being modified in this study serves students next-semester course
considerations based on their personal course enrollment history
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Earlier systems included a focus on requirement satisfaction
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and career-based relevancy recommendation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. No system
has yet focused on serendipitous or novel course discovery.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>MODELS AND METHODOLOGY</title>
      <p>
        This section introduces three competing models used to generate
our representations. The first model uses course2vec [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to learn
course representations from enrollment sequences. Our second
model is a variant on course2vec, which learns representations of
explicitly defined features of a course (e.g., instructor or department)
in addition to the course representation. The intuition behind this
approach is that the course representation could have, conflated
in it, the influence of the multiple-instructors that have taught
the course over time. We contend that this "deconflation" may
increase the fidelity of the course representation and serve as a
more accurate representation of the topical essence of the course.
The last representation model is a standard bag-of-words vector,
constructed for each course strictly from its catalog description.
Finally, we explore concatenating a course’s course2vec and
bag-ofwords representation vector.
2.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Course2vec</title>
      <p>The course2vec model involves learning distributed representations
of courses from students’ enrollment records throughout semesters
by using a notion of an enrollment sequence as a "sentence" and
courses within the sequence as "words", borrowing terminology
from the linguistic domain. For each student s, a chronological
course enrollment sequence is produced by first sorting by
semester then randomly serializing within-semester course order. Then,
each course enrollment sequence is used in training, similar to a
document in a classical skip-gram application.</p>
      <p>
        The training objective of the skip-gram model is to find word
representations that are useful for predicting the surrounding words
in a sentence or a document. Each word in the corpus is used as an
input to a log-linear classifier with continuous projection layer, to
predict words within a certain range before and after the current
word. Therefore, the skip-gram model can be also deemed as a
classifier with input as a target course and output as a context
course. In this section, we consider adding features of a course to
the input to enhance the classifier and its representations, as shown
in Figure 1. Each course is taught by one or several instructors
over the years and is associated with an academic department.
The multi-factor course2vec model learns both course and course
feature representations by maximizing the objective function over
all the students’ enrollment sequences and the features of courses.
Full technical details can be found in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>In language models, two word vectors will be cosine similar if
they share similar sentence contexts. Likewise, in the university
domain, courses that share similar co-enrollments, and similar
previous and next semester enrollments, will likely be close to one
another in the vector space.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Bag-of-Words</title>
      <p>
        A simple but indelible approach to item representation has been to
create a vector, the length of the number of unique words across all
items, with a non-zero value if the word in the vocabulary appears
in it. Only unigram words are used to create this unordered vector
list of words used to represent the document [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The basic methodology based on bag-of words proposed by IR
researchers for text corpora - a methodology successfully deployed
in modern Internet search engines - reduces each document in the
corpus to a vector of real numbers, each of which represents a term
weight. The term weight might be:
• a term frequency value indicating how many times the term
occurred in the document.
• a binary value with 1 indicating that the term occurred in
the document, and 0 indicating that it did not.
• tf-idf scheme [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the product of term frequency and inverse
document frequency, which increases proportionally to the
number of times a word appears in the document and is
ofset by the frequency of the word in the corpus and helps
to adjust for the fact that some words appear more frequently
in general.
      </p>
      <p>We evaluate all three variants in our quantitative validation testing.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Surfacing Serendipitous Recommendations from Course Representations</title>
      <p>We surface recommendations intended to be interesting but
unexpected by finding an objective course cj which is most similar to
a student’s favorite course ci , diversifying the results by allowing
only one result per department dj :
c∗j = arg max cos(c, ci )
c,d(c)=dj
(1)
where d c</p>
      <p>( ) means the the department of course c. Then all the
counterpart courses c∗j in all the other departments will be ranked
according to cos(c∗j, ci ), where j = 1, 2..., k. We can apply both
neural representations and bag-of-words representations of courses in
this method to generate the most similar courses in each
department.
3
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTAL ENVIRONMENTS</title>
    </sec>
    <sec id="sec-8">
      <title>Of-line Dataset</title>
      <p>We used a dataset containing anonymized student course
enrollments at UC Berkeley from Fall 2008 through Fall 2017. The dataset
consists of per-semester course enrollment records for 164,196
students (both undergraduates and graduates) with a total of 4.8 million
enrollments. A course enrollment record means that the student
was still enrolled in the course at the end of the semester. Students at
this university, during this period, were allowed to drop courses up
until close to the end of the semester without penalty. The median
course load during students’ active semesters was four. There were
9,478 unique lecture courses from 214 departments1 hosted in 17
diferent Divisions of 6 diferent Colleges. Course meta-information
contains course number, department name, total enrollment and
max capacity. In this paper, we only consider lecture courses with
at least 20 enrollments total over the 9-year period, leaving 7,487
courses. Although courses can be categorized as undergraduate
courses and graduate courses, undergraduates are permitted to
enroll in many graduate courses no matter their status.</p>
      <p>Enrollment data were sourced from the campus enterprise data
warehouse with course descriptions sourced from the oficial
campus course catalog API. We pre-processed the course description
data in the following steps: (1) removing generic, often-seen
sentences across descriptions (2) removing stop words (3) removing
punctuation (4) word lemmatization and stemming, and finally
tokenizing the bag-of-words in each course description. We then
compiled the term frequency vector, binary value vector, and tf-idf
vector for each course.
1At UC Berkeley, the smallest academic unit is called a "subject." For the purpose of
communicability, we instead refer to subjects as departments.
3.1.1 Semantic Validation Sets. In order to quantitatively evaluate
how accurate the vector models are, a source of ground truth on
the relationships between courses needed to brought to bear to
see the degree to which the vector representations encoded this
information. We used two such sources of ground truth to serve as
validation sets, one providing information on similarity, the other
on a variety of semantic relationships.</p>
      <p>We trained the models described in Section 2.1 on the student
enrollment records data. Specifically, we added the instructor(s)
who teach the course and the course department as two input
features of courses in the multi-factor course2vec model.</p>
      <p>
        To evaluate course vectors on the course equivalency validation
set, we fixed the first course in each pair and ranked all the other
courses according to their cosine similarity to the first course in
• Equivalency validation set: A set of 1,351 course credit-equivalency descending order. We then noted the rank of the expected second
pairs maintained by the Ofice of the Registrar were used for course in the pair and described the performance of each model
on all validation pairs in terms of mean rank, median rank and
recall@10.
similarity based ground truth. A course is paired with
another course in this set if a student can only receive credit for
taking one of the courses. For example, an honors and
nonhonors version of a course will be appear as a pair because
faculty have deemed that there is too much overlapping
material between the two for a student to receive credit for
both.
• Analogy validation set: The standard method for validating learned
word vectors has been to use analogy to test the degree to which the
embedding structure contains semantic and syntactic relationships
constructed from prior knowledge. In the domain of university
courses, we use course relationship pairs constructed from prior
work using first-hand knowledge of the courses [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The 77
relationship pairs were in five categories; online, honors, mathematical
rigor, 2-department topics, and 3-department topics. An example of
an "online" course pair would be Engineering 7 and its online
counterpart, Engineering W7 or Education 161 and W161. An analogy
involving these two paris could be calculated as: Engineering 7W
Engineering 7 + Education 161 ≈ EducationW 161.
3.2
      </p>
    </sec>
    <sec id="sec-9">
      <title>Online Environment (System Overview)</title>
      <p>The production recommender system at UC Berkeley uses a
student data pipeline with the enterprise data warehouse to keep
up-to-date enrollment histories of students. Upon CAS login, these
histories are associated with the student and passed through an
RNN model, which cross-references the output recommendations
with the courses ofered in the target semester. Class availability
information is retrieved during the previous semester from a
campus API once the registrar has released the schedule. The system is
written with an AngularJS front-end and python back-end service
which loads the machine learned models written in pyTorch. These
models are version controlled on github and refreshed three times
per semester after student enrollment status refreshes from the
pipeline. The system receives trafic of around 20% of the
undergraduate student body, partly from the UC Berkeley Registrar’s
website.
4</p>
    </sec>
    <sec id="sec-10">
      <title>VECTOR MODEL REFINEMENT</title>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS</title>
      <p>In this section, we first introduce our experiment parameters and
the ways we validated the representations quantitatively. Then, we
describe the various ways in which we refined the models and the
results of these refinement.
4.1</p>
    </sec>
    <sec id="sec-12">
      <title>Model Evaluations</title>
      <p>For evaluation of the course analogy validation set, we followed
the analogy paradigm of: course2 − course1 + course3 ≈ course4.
Courses were ranked by their cosine similarity to course2−course1+
course3. An analogy completion is considered accurate (a hit) if
the first ranked course is the expected course4 (excluding the other
three from the list). We calculated the average accuracy (recall@1)
and the recall@10 over all the analogies in the analogy validation
set.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Course2vec vs. Multi-factor Course2vec</title>
      <p>
        We compared the pure course2vec model with the course
representations from the multi-factor course2vec model using instructor,
department, and both as factors. Full results of evaluation on the
equivalency validation and analogy validation are shown in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
The multi-factor model outperformed the pure course2vec model
in terms of recall@10 in both validation sets, with the combined
instructor and department factor model performing the best.
4.3
      </p>
    </sec>
    <sec id="sec-14">
      <title>Bag-of-words vs. Multi-factor Course2vec</title>
      <p>
        Among the three bag-of-words models, tf-idf performed the best in
all equivalency set metrics. The median rank (best=4) and recall@10
(best=0.5647) for the bag-of-words models were also substantially
better than the best course2vec models, which had a best median
rank of 15 with best recall@10 of 0.4485 for the multi-factor
instructor and department model. All course2vec models; however,
showed better mean rank performance (best=224) compared with
bag-of-words (best=566). This suggests that there are many outliers
where literal semantic similarity (bag-of-words) is very poor at
identifying equivalent pairs, whereas course2vec has much fewer
near worst-case examples. This result is consistent with prior work
comparing pure course2vec models to binary bag-of-words [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>When considering performance on the analogy validation, the
roles are reversed, with all course2vec models performing better
than the bag-of-words models in both accuracy and recall@10. The
diference in recall of bag-of-words compared to course2vec when
it comes to analogies is substantial (0.581 vs 0.8557), a considerably
larger diference than between bag-of-words and course2vec on
equivalencies (0.5647 vs 0.4485). Again, the multi-factor instructor
and department model and tf-idf were the best models in their
respective class. These analyses establish that bag-of-words models
are moderately superior in capturing course similarity, but are
highly inferior to enrollment-based course2vec models in the more
complex task of analogy completion.
4.4</p>
    </sec>
    <sec id="sec-15">
      <title>Combining Bag-of-words and Course2vec</title>
    </sec>
    <sec id="sec-16">
      <title>Representations</title>
      <p>In light of the strong analogy performance of course2vec and strong
equivalency performance bag-of-words in the previous section,
we concatenated the multi-factor course2vec representations with
bag-of-words representations. To address the diferent magnitudes
in the vectors between the two concatenated representations, we
create a normalized version of each vector set for comparison to
non-normalized sets.</p>
      <p>We found that the normalized concatenation of tf-idf with
multifactor course2vec performed substantially better on the equivalency
test than the previous best model in terms of recall@10 (0.6435 vs.
0.5647). While the median rank of the concatenated model only
improved one rank, from 4 to 3, the mean rank improved dramatically
(from 566 to 132), and is the best of all models tested in terms of
mean rank. Non-normalized vectors did not show improvements
over bag-of-words alone in median rank and recall@10.
Improvements in the analogy test were more mild, with a recall@10 of
0.8788 for the best concatenated model, combining binary
bag-ofwords with multi-factor course2vec, compared with 0.8557 for the
best course2vec only model. Normalization in the case of analogies
hurt all model performance, the opposite of what was observed in
the equivalency test. This suggests that normalization improves
local similarity but may act to degrade the more global structure of
the vector space.
5</p>
    </sec>
    <sec id="sec-17">
      <title>USER STUDY</title>
      <p>
        A user study was conducted to evaluate the quality of
recommendations drawn from our diferent course representations. Users
rated each course from each recommendation algorithm along five
dimensions of quality. Students were asked to rate course
recommendations in terms of their (1) unexpectedness (2) successfulness
/ interest in taking the course (3) novelty (4) diversity of the results
(5) and identifiable commonality among the results. In Shani and
Gunawardana [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], authors defined serendipity as the combination
of "unexpectedness" and "success." In the case of a song
recommender, for example, success would be defined as the user listening
to the recommendation. In our case, we use a student’s expression
of interest in taking the course as a proxy for success. The mean
of their unexpectedness and successfulness rating will comprise
our measure of serendipity. We evaluated three of our developed
models, all of which displayed 10 results, only showing one course
per department in order to increase diversity (and
unexpectedness). The models were (1) the best BOW model (tf-idf), (2) the best
Analogy validation model (binary BOW + multi-factor course2vec
normalized), (3) and the best Equivalency validation model (tf-idf +
multi-factor course2vec non-normalized). To measure the impact
our department diversification filter would have on serendipity,
we added a version of the best Equivalency model that did not
impose this filter, allowing multiple courses to be displayed from
the same department if they were the most cosine similar to the
user’s specified favorite course. Our fifth comparison
recommendation algorithm was the system’s existing collaborative-filtering
based Recurrent Neural Network (RNN) that recommends courses
based on a prediction of what the student is likely to take next
given their personal course history and what other students with a
similar history have taken in the past [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. All five algorithms were
integrated into a real-world recommender system for the purpose
of this study and evaluated by 70 undergraduates at the University.
5.1
      </p>
    </sec>
    <sec id="sec-18">
      <title>Study Design</title>
      <p>Undergraduates were recruited from popular University associated
Facebook groups and asked to sign-up for a one hour evaluation
session. Since they would need to specify a favorite course they had
taken, we restricted participants to those who had been at the
University at least one full semester and were currently enrolled. The
study was run at the beginning of the Fall semester, while courses
could still be added and dropped and some students were still
shopping for courses. We used a within-subjects design whereby each
volunteer rated ten course recommendations made by each of the
ifve algorithms. Because of the considerable number of ratings
expected ([3*10+2]*5 = 160) and the importance for students to
carefully consider each recommended course, in-person sessions
were decided on over asynchronous remote sessions in order to
better encourage on-task behavior throughout the session. Student
evaluators were compensated with a $40 gift card to attend one of
four sessions ofered across three days with a maximum occupancy
of 25 in each session. A total of 702 students participated.</p>
      <p>We began the session by introducing the evaluation motivation
as a means for students to help inform the choice of algorithm
that we will use for a future campus-wide deployment of a course
exploration tool. Students started the evaluation by visiting a
survey URL that asked them to specify a favorite course they had
taken at the University. This favorite course was used by the first
four algorithms to produce 10 course recommendations each. Each
recommended course’s department, course number, title, and full
catalog description were displayed to the student in the interface.
There was a survey page (Figure 2) for each algorithm in which
students were asked to read the recommended course descriptions
carefully and then rate each of the ten courses individually on a
ifve point Likert scale agreement with the following statements:
(1) This course was unexpected (2) I am interested in taking this
course (3) I did not know about this course before. These ratings
respectively measured unexpectedness, successfulness, and novelty.
After rating the individual courses, students were asked to rate
their agreement with the following statements pertaining to the
10 results as a whole: (1) Overall, the course results were diverse
(2) The course results shared something in common with my
favorite course. These ratings measured dimensions of diversity and
commonality. Lastly, students were asked to provide an optional
follow-up open text response to the question, "If you identified
something in common with your favorite course, please explain it
here." On the last page of the survey, students were asked to specify
their major, year, and to give optional open response feedback on
their experience. Graduate courses were not included in the
recommendations and the recommendations were not limited to courses
available in the current semester.
5.2</p>
    </sec>
    <sec id="sec-19">
      <title>Results</title>
      <p>Results of average student ratings of the five algorithms across
the six measurement categories are shown in Table 1. The
diversity based algorithms, denoted by "(div)," all scored higher than
the non-diversity (non-div) algorithms in unexpectedness, novelty,
diversity, and the primary measure of serendipity. The two
nondiversity based algorithms; however, both scored higher than the
other three algorithms in successfulness and commonality. All
pairwise diferences between diversity and non-diversity algorithms
were statistically significant, using the p &lt; 0.001 level after applying
a Bonferoni correction for multiple (60) tests. Within the diversity
2Due to an authentication bug during the fourth session, twenty participating students
were not able to access the collaborative recommendations of the fifth algorithm. RNN
results in the subsequent section are therefore based on the 50 students from the first
three sessions. When paired t-tests are conducted between RNN and the ratings of
other algorithms, the tests are between ratings among these 50 students.
algorithms, there were no statistically significant diferences except
for BOW scoring higher than Equivalency (div) on unexpectedness
and scoring higher than both Equivalency (div) and Analogy (div)
on novelty. Among the two non-diversity algorithms, there were
no statistically significant diferences except for the RNN scoring
higher on diversity and Equivalency (non-div) recommendations
scoring higher on novelty. With respect to measures of serendipity,
the div and non-div algorithms had similar scores among their
respective strengths (3.473-3.619); however, the non-div algorithms
scored substantially lower in their weak category of unexpectedness
(2.091 &amp; 2.184) than did the div algorithms in their weak category of
successfulness (2.851-2.999), resulting in statistically significantly
higher serendipity scores for the div algorithms.</p>
      <p>The most dramatic diference can be seen in the measure of
novelty, where BOW (div) scored 3.896 and the system’s existing RNN
(non-div) scored 1.824, the lowest rating in the results matrix. The
proportion of each rating level given to the two algorithms on this
question is shown in Figures 3 and 5. Hypothetically, an algorithm
that recommended randomly selected courses would score high
in both novelty and unexpectedness, and thus it is critical to also
weigh their ability to recommend courses that are also of interest
to students. Figure 4 shows successfulness ratings for each of the
algorithms aggregated by rank of the course result. The non-div
algorithms, shown with dotted lines, always perform as well or
better than the div algorithms at every rank. The more steeply
declining slope of the div algorithms depicts the increasing dificulty
of finding courses of interest across diferent departments. The
tension between the ability to recommend courses of interest that are
also unexpected is shown in Figure 6, where the best serendipitous
model, BOW (div), recommends a top course of higher
successfulness than unexpectedness, with the two measures intersecting
at rank 2 and diverging afterwards. The best equivalency model,
combining course description tf-idf and course2vec (non-div),
maintains high successfulness but also maintains low unexpectedness
across the 10 course recommendation ranks.</p>
      <p>Are more senior students less likely to rate courses as novel or
unexpected, given they have been at the University longer and been
exposed to more courses? Among our sophomore (27), junior (22),
and senior (21) level students, there were no statistically significant
trends among the six measures, except for a marginally significant
trend (p = 0.007, shy of the p &lt; 0.003 threshold given the Bonferroni
correction) of more senior students rating recommendations as less
unexpected (avg = 2.921) than juniors (avg = 3.024), whose ratings
were not statistically separable from sophomores (avg = 3.073).
In this section, we attempt to synthesize qualitative
characterizations of the diferent algorithms by looking at the open responses
students gave to the question asking them to describe any
commonalities they saw among recommendations made by each algorithm
to their favorite course.
5.3.1 BOW (div). Several students remarked positively about
recommendations matching to the themes of "art, philosophy, and
society" or "design" exhibited in their favorite course. The word
"language" was mentioned by 14 of the 61 respondents answering
the open response question. Most of these comments were negative,
pointing out the limitations of similarity matching based solely
on literal course description matching. The most common critique
given in this category was of the foreign spoken language courses
that showed up at the lower ranks when students specified a
favorite course involving programming languages. Other students
remarked at additional dissimilarity when specifying a favorite
course related to cyber security, receiving financial security courses
in the results.
5.3.2 Analogy (div). The word "interesting" appeared in seven of
the 54 comments left by students to describe commonalities among
the analogy validation optimized algorithm. This word was not
among the top 10 most frequent words in any of the other four
algorithms. Several students identified broad themes among the
courses that matched to their favorite course, such as "identity" and
"societal development." On the other end of the spectrum, one
student remarked that the results "felt weird" and were only "vaguely
relevant." Another student stated that, "the most interesting
suggestion was the Introduction to Embedded Systems [course] which is
just diferent enough from my favorite course that it’s interesting
but not too diferent that I am not interested," which poignantly
articulates the crux of dificulty in striking a balance between interest
and unexpectedness to achieve a serendipitous recommendation.
5.3.3 Equivalency (div). Many students (seven of the 55) remarked
positively on the commonality of the results with themes of data
exhibited by their favorite course (in most cases STATS C8, an
introductory data science course). They mentioned how the courses
all involved "interacting with data in diferent social, economic,
and psychological contexts" and "data analysis with diferent
applications." One student remarked on this algorithm’s tendency to
match at or around the main topic of the favorite course, further
remarking that "they were relevant if looking for a class tangentially
related."
5.3.4 Equivalency (non-div). This algorithm was the same as the
above, except that it did not limit results to one course per
department. Because of this lack of department filter, 15 of the 68 students
submitting open text responses to the question of commonality
pointed out that the courses returned were all from the same
department. Since this model scored highest on a validation task of
matching to a credit equivalent course pair (almost always in the
same department), it is not surprising that students observed that
results from this algorithm tended to all come from the department
of the favorite course, which also put it close to their nexus of
interest.
5.3.5 RNN (non-div). The RNN scored lowest in novelty,
significantly lower than the other non-div algorithm, and was not
significantly diferent from the other non-div algorithm in successfulness.
In this case, what is the possible utility of the collaborative-based
RNN over the non-div Equivalency model? Many of the 47 (of 50)
student answers to the open response commonality question
remarked that the recommendations related to their major (mentioned
by 21 students) and contained courses that fulfilled a requirement
(mentioned by seven) as the distinguishing signature of this
algorithm. Since the RNN is based on normative next course enrollment
behavior, it is reasonable that it suggested many courses that satisfy
an unmet requirement. This algorithm’s ability to predict student
enrollments accurately became a detriment to some, as seven
remarked that it was recommending courses that they were currently
enrolled in. Due to the institutional data refresh schedule, student
current enrollments are not known until after the add/drop deadline.
This may be a shortcoming that can be rectified in the future.
6</p>
    </sec>
    <sec id="sec-20">
      <title>FEATURE RE-DESIGN</title>
      <p>As a result of the feedback received from the user study, we worked
with campus to pull down real-time information on student
requirement satisfaction from the Academic Plan Review module of the
PeopleSoft Student Information System. We re-framed the RNN
feature as a "Requirements" satisfying feature that, upon log-in, shows
students their personalized list of unsatisfied requirements (Figure
8). After selecting a requirement category to satisfy, the system
displays courses which satisfy the selected requirement and are
offered in the target semester. The list of courses is sorted by the RNN
to represent the probability that students like them will take the
class. This provides a signal to the student of what the normative
course taking behavior is in the context of requirement satisfaction.
For serendipitous suggestions, we created a separate "Explore" tab
(Figure 7) using the BOW (div) model to surface the top five courses
similar across departments, due to its strong serendipitous and
novelty ratings. The Equivalency (non-div) model was used to display
an additional most similar five courses within the same department.
This model was chosen due to its strong successfulness ratings.
7</p>
    </sec>
    <sec id="sec-21">
      <title>DISCUSSION</title>
      <p>Surfacing courses that are of interest but not known before means
expanding a student’s knowledge and understanding of the
University’s oferings. As students are exposed to courses that veer further
from their home department and nexus of interest and
understanding, recommendations become less familiar with descriptions that
are harder to connect with. This underscores the dificulty of
producing an unexpected but interesting course suggestion, as it often
must represent a recommendation of uncommon wisdom in order
to extend outside of a student’s zone of familiarity surrounding
their centers of interest. Big data can be a vehicle for, at times,
reaching that wisdom. Are recommendations useful when they
suggest something expected or already known? Two distinct sets of
responses to this question emerged from student answers to the last
open ended feedback question. One representative remark stated,
"The best algorithms were the ones that had more
diverse options, while still staying true to the core
function of the class I was searching. The
algorithms that returned classes that were my major
requirements/in the same department weren’t as
helpful because I already knew of their existence
as electives I could be taking"
While a diferent representative view was expressed with,
"I think the fifth algorithm [RNN] was the best fit
for me because my major is pretty standardized"
These two comments make a case for both capabilities being of
importance. They are also a reminder of the desire among young
adults for the socio-technical systems of the university to ofer a
balance of information, exploration and, at times, guidance.</p>
    </sec>
    <sec id="sec-22">
      <title>8 LIMITATIONS</title>
      <p>
        The more distal a course description is, even if conceptually similar,
the less a student may be able to recognize the commonality with
a favorite course. A limitation of our study in demonstrating the
utility of the neural embedding is that students had to rely on the
course description semantics in order to familiarize themselves
with the suggested course and determine if they were interested
in taking it. If a concept was detected by the neural embedding
but not the BOW, this likely meant that the concept was dificult
to pick-up from the course description displayed to students. Past
work has shown that users evaluate media recommendations less
favorably before they take the recommendation than after when
important aspects of the recommended content is not described in the
recommendation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Future work could augment recommended
course descriptions with additional information, including latent
semantics inferred from enrollments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or additional semantics
retrieved from available course syllabi.
      </p>
    </sec>
    <sec id="sec-23">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partly supported by the United States National
Science Foundation (1547055/1446641) and the National Natural
Science Foundation of China (71772101/71490724).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sorathan</given-names>
            <surname>Chaturapruek</surname>
          </string-name>
          , Thomas Dee, Ramesh Johari, René Kizilcec, and Mitchell Stevens.
          <year>2018</year>
          .
          <article-title>How a data-driven course planning tool afects college students' GPA: evidence from two field experiments</article-title>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hung-Hsuan Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Behavior2Vec: Generating Distributed Representations of UsersâĂŹ Behaviors on Products for Recommender Systems</article-title>
          .
          <source>ACM Transactions on Knowledge Discovery from Data (TKDD) 12</source>
          ,
          <issue>4</issue>
          (
          <year>2018</year>
          ),
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D</given-names>
            <surname>Manning Christopher</surname>
          </string-name>
          , Raghavan Prabhakar, and
          <string-name>
            <given-names>Schacetzel</given-names>
            <surname>Hinrich</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Introduction to information retrieval</article-title>
          .
          <source>An Introduction To Information Retrieval</source>
          <volume>151</volume>
          ,
          <issue>177</issue>
          (
          <year>2008</year>
          ),
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Dillon</surname>
          </string-name>
          .
          <year>1983</year>
          .
          <article-title>Introduction to modern information retrieval: G. Salton and M. McGill</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , New York (
          <year>1983</year>
          ). 448 pp.,
          <source>ISBN 0-07-054484-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Matt</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Run</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <article-title>Zach A. Pardos</article-title>
          . in press. Design and
          <article-title>Deployment of a Better Course Search Tool: Inferring latent keywords from enrollment networks</article-title>
          .
          <source>In Proceedings of the 14th European Conference on Technology Enhanced Learning</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Rosta</given-names>
            <surname>Farzan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Encouraging user participation in a course recommender system: An impact on user behavior</article-title>
          .
          <source>Computers in Human Behavior</source>
          <volume>27</volume>
          ,
          <issue>1</issue>
          (
          <year>2011</year>
          ),
          <fpage>276</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Li</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Pietro</given-names>
            <surname>Perona</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>One-shot learning of object categories</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence 28</source>
          ,
          <issue>4</issue>
          (
          <year>2006</year>
          ),
          <fpage>594</fpage>
          -
          <lpage>611</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Balázs</given-names>
            <surname>Hidasi</surname>
          </string-name>
          , Massimo Quadrana, Alexandros Karatzoglou, and
          <string-name>
            <given-names>Domonkos</given-names>
            <surname>Tikk</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Parallel recurrent neural network architectures for feature-rich session-based recommendations</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>241</volume>
          -
          <fpage>248</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Weijie</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Zachary A Pardos,</article-title>
          and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Wei</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Goal-based course recommendation</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Learning Analytics &amp; Knowledge. ACM</source>
          ,
          <volume>36</volume>
          -
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Judy</given-names>
            <surname>Kay</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Stereotypes, student models and scrutability</article-title>
          .
          <source>In International Conference on Intelligent Tutoring Systems</source>
          . Springer,
          <fpage>19</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Benedikt</surname>
            <given-names>Loepp</given-names>
          </string-name>
          , Tim Donkers, Timm Kleemann, and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Ziegler</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Impact of item consumption on assessment of recommendations in user studies</article-title>
          .
          <source>In Proceedings of the 12th ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>49</volume>
          -
          <fpage>53</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tien</surname>
            <given-names>T Nguyen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pik-Mai Hui</surname>
            ,
            <given-names>F Maxwell</given-names>
          </string-name>
          <string-name>
            <surname>Harper</surname>
          </string-name>
          ,
          <source>Loren Terveen, and Joseph A Konstan</source>
          .
          <year>2014</year>
          .
          <article-title>Exploring the filter bubble: the efect of using recommender systems on content diversity</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World wide web. ACM</source>
          ,
          <volume>677</volume>
          -
          <fpage>686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Aditya</surname>
            <given-names>Parameswaran</given-names>
          </string-name>
          , Petros Venetis, and
          <string-name>
            <surname>Hector</surname>
          </string-name>
          Garcia-Molina.
          <year>2011</year>
          .
          <article-title>Recommendation systems with complex constraints: A course recommendation perspective</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 29</source>
          ,
          <issue>4</issue>
          (
          <year>2011</year>
          ),
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zachary</surname>
            <given-names>A Pardos</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Zihao</given-names>
            <surname>Fan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Weijie</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Connectionist recommendation in the wild: on the utility and scrutability of neural networks for personalized course guidance</article-title>
          .
          <source>User Modeling and User-Adapted Interaction 29</source>
          ,
          <issue>2</issue>
          (
          <year>2019</year>
          ),
          <fpage>487</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Zachary</surname>
            <given-names>A Pardos</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Weijie</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Combating the Filter Bubble: Designing for Serendipity in a University Course Recommendation System</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>01591</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Zachary</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Pardos and Andrew Joo Hun Nam</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Map of Knowledge. CoRR preprint</article-title>
          , abs/
          <year>1811</year>
          .07974 (
          <year>2018</year>
          ). https://arxiv.org/abs/
          <year>1811</year>
          .07974
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Agoritsa</given-names>
            <surname>Polyzou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N</given-names>
            <surname>Athanasios</surname>
          </string-name>
          , and
          <string-name>
            <given-names>George</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Scholars Walk: A Markov Chain Framework for Course Recommendation</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Educational Data Mining</source>
          .
          <fpage>396</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Zhiyun</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Xia Ning, Andrew S Lan, and
          <string-name>
            <given-names>Huzefa</given-names>
            <surname>Rangwala</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Grade Prediction Based on Cumulative Knowledge and Co-taken Courses</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Educational Data Mining</source>
          .
          <fpage>158</fpage>
          -
          <lpage>167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Leonardo</surname>
            <given-names>FR Ribeiro</given-names>
          </string-name>
          , Pedro HP Saverese, and Daniel R Figueiredo.
          <year>2017</year>
          .
          <article-title>struc2vec: Learning node representations from structural identity</article-title>
          .
          <source>In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM</source>
          ,
          <volume>385</volume>
          -
          <fpage>394</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Guy</given-names>
            <surname>Shani</surname>
          </string-name>
          and
          <string-name>
            <given-names>Asela</given-names>
            <surname>Gunawardana</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Evaluating recommendation systems</article-title>
          .
          <source>In Recommender systems handbook</source>
          . Springer,
          <fpage>257</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Oriol</surname>
            <given-names>Vinyals</given-names>
          </string-name>
          , Charles Blundell, Tim Lillicrap,
          <string-name>
            <given-names>Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Matching networks for one shot learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>3630</volume>
          -
          <fpage>3638</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Yuan</given-names>
            <surname>Cao</surname>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Diarmuid Ó Séaghdha, Daniele Quercia, and
          <string-name>
            <given-names>Tamas</given-names>
            <surname>Jambor</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Auralist: introducing serendipity into music recommendation</article-title>
          .
          <source>In Proceedings of the fifth ACM international conference on Web search and data mining . ACM</source>
          ,
          <volume>13</volume>
          -
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>