CCS CONCEPTS

Designing for Serendipity in a University Course Recommendation System

Zachary Pardos

zp@berkeley.edu 1

Weijie Jiang

jiangwj.14@sem.tsinghua.edu.cn 0 0 Tsinghua University & University of California , Berkeley , USA 1 University of California , Berkeley , USA

2019

Collaborative filtering based algorithms, including Recurrent Neural Networks (RNN), tend towards predicting a perpetuation of past observed behavior. In a recommendation context, this can lead to an overly narrow set of suggestions lacking in serendipity and inadvertently placing the user in what is known as a "filter bubble." In this paper, we grapple with the issue of the filter bubble in the context of a course recommendation system in production at a public university. Our approach is to present course results that are novel or unexpected to the student but still relevant to their interests. We build one set of models based on the course catalog description (BOW) and another set informed by enrollment histories (course2vec). We compare the performance of these models on of-line validation sets and against the system's existing RNN-based recommendation engine in a user study of undergraduates (N = 70) who rated their course recommendations along six characteristics related to serendipity. Results of the user study show a dramatic lack of novelty in RNN recommendations and depict the characteristic trade-ofs that make serendipity dificult to achieve. While the machine learned course2vec models performed best on concept generalization tasks (i.e, course analogies), it was the simple bag-of-words based recommendations that students rated as more serendipitous. We discuss the role of the recommendation interface and the information presented therein in the student's decision to accept a recommendation from either algorithm.

CCS CONCEPTS

• Applied computing → Education; • Information systems → Recommender systems. Higher education, course guidance, filter bubble, neural networks

INTRODUCTION

Among the institutional values of a liberal arts university is to expose students to a variety of perspectives expressed in courses across its various physical and intellectual schools of thought. Collaborative filtering based sequence prediction methods, in this environment, can provide personalized course recommendations based on temporal models of normative behavior [ 14 ] but are not well suited for surfacing courses a student may find interesting but which have been relatively unexplored by those with similar course selections to them in the past. Therefore, a more diversity oriented model can serve as an appropriate compliment to recommendations made from collaborative based methods. This problem of training on the past without necessarily repeating it is an open problem in many collaborative filtering based recommendation contexts, particularly social networks, where, in the degenerate cases, users can get caught in "filter bubbles," or model-based user stereotypes, leading to a narrowing of item recommendation variety [ 10, 12, 23 ].

To counteract the filter bubble, we introduce a course2vec variant into a production recommender system at a public university designed to surface serendipitous course suggestions. Course2vec applies a skip-gram to course enrollment histories, instead of natural language, in order to learn representations. We use the definition of serendipity as user perceived unexpectedness of result combined with successfulness [ 20 ], which we define as a course recommendation a student expresses interest in taking. At many universities, conceptually similar courses exist across departments but use widely difering disciplinary vernacular in their catalog descriptions, making them dificult for learners to search for and to realize their commonality. We propose that by tuning a vector representation of courses learned from nine years of enrollment sequences, we can capture enough implicit semantics of the courses to more abstractly, and accurately construe similarity. To encourage the embedding to learn features that may generalize across departments, our skip-gram variant simultaneously learns department (and instructor) embeddings. While more advanced attention-based text generation architectures exist [ 21 ], we demonstrate that properties of the linear vector space produced by "shallow" networks are of utility to this recommendation task. Our recommendations are made with only a single explicit course preference given by the user, as opposed to the entire course selection history needed by sessionbased Recurrent Neural Network approaches [ 8 ]. Single example, also known as "one-shot," generalization is common in the vision community, which has pioneered approaches to extrapolating a category from a single labeled example [ 7, 22 ]. Other related work applying skip-grams to non-linguistic data include node embeddings learned from sequences of random walks of graphs [ 19 ] and product embeddings learned from ecommerce clickstream [ 2 ]. Our work, methodologically, adds rigor to this approach by tuning the model against validation sets created from institutional knowledge and curated by the university.

We conduct a user study (N = 70) of undergraduates at the University to evaluate their personalized course recommendations made by our models designed for serendipity and by the RNN-based engine, which previously drove recommendations in the system. The findings underscore the tension between unexpectedness and successfulness and show the deficiency of RNNs for producing novel recommendations. While our course2vec based model scored 68% above bag-of-words in accuracy in one of our course analogy validation set, simple bag-of-words scored slightly higher in the main objective of user perceived serendipity. A potential reason for this discrepancy is the nature of information presented to students in the recommender system interface. Catalog descriptions of recommended courses were shown to students, which served as the only source of information they could consult in deciding if they wanted to take the course. A generated explanation, or prioritization of the course2vec recommendation in the interface may be required to overcome the advantage of the bag-of-words model being based on the same information being shown to them in the recommendations.

Recommender systems in higher education contexts have recently focused on prediction of which courses a student will take [ 14, 17 ] or the grade they will receive if enrolled [ 9, 18 ]. At Stanford, a system called "CARTA" allows students to see grade distributions, course evaluations, and the most common courses taken before a course of interest [ 1 ]. At UC Berkeley, the recommender system being modified in this study serves students next-semester course considerations based on their personal course enrollment history [ 14 ]. Earlier systems included a focus on requirement satisfaction [ 13 ] and career-based relevancy recommendation [ 6 ]. No system has yet focused on serendipitous or novel course discovery. 2

MODELS AND METHODOLOGY

This section introduces three competing models used to generate our representations. The first model uses course2vec [ 14 ] to learn course representations from enrollment sequences. Our second model is a variant on course2vec, which learns representations of explicitly defined features of a course (e.g., instructor or department) in addition to the course representation. The intuition behind this approach is that the course representation could have, conflated in it, the influence of the multiple-instructors that have taught the course over time. We contend that this "deconflation" may increase the fidelity of the course representation and serve as a more accurate representation of the topical essence of the course. The last representation model is a standard bag-of-words vector, constructed for each course strictly from its catalog description. Finally, we explore concatenating a course’s course2vec and bag-ofwords representation vector. 2.1

Course2vec

The course2vec model involves learning distributed representations of courses from students’ enrollment records throughout semesters by using a notion of an enrollment sequence as a "sentence" and courses within the sequence as "words", borrowing terminology from the linguistic domain. For each student s, a chronological course enrollment sequence is produced by first sorting by semester then randomly serializing within-semester course order. Then, each course enrollment sequence is used in training, similar to a document in a classical skip-gram application.

The training objective of the skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. Each word in the corpus is used as an input to a log-linear classifier with continuous projection layer, to predict words within a certain range before and after the current word. Therefore, the skip-gram model can be also deemed as a classifier with input as a target course and output as a context course. In this section, we consider adding features of a course to the input to enhance the classifier and its representations, as shown in Figure 1. Each course is taught by one or several instructors over the years and is associated with an academic department. The multi-factor course2vec model learns both course and course feature representations by maximizing the objective function over all the students’ enrollment sequences and the features of courses. Full technical details can be found in [ 15 ].

In language models, two word vectors will be cosine similar if they share similar sentence contexts. Likewise, in the university domain, courses that share similar co-enrollments, and similar previous and next semester enrollments, will likely be close to one another in the vector space. 2.2

Bag-of-Words

A simple but indelible approach to item representation has been to create a vector, the length of the number of unique words across all items, with a non-zero value if the word in the vocabulary appears in it. Only unigram words are used to create this unordered vector list of words used to represent the document [ 3 ].

The basic methodology based on bag-of words proposed by IR researchers for text corpora - a methodology successfully deployed in modern Internet search engines - reduces each document in the corpus to a vector of real numbers, each of which represents a term weight. The term weight might be: • a term frequency value indicating how many times the term occurred in the document. • a binary value with 1 indicating that the term occurred in the document, and 0 indicating that it did not. • tf-idf scheme [ 4 ], the product of term frequency and inverse document frequency, which increases proportionally to the number of times a word appears in the document and is ofset by the frequency of the word in the corpus and helps to adjust for the fact that some words appear more frequently in general.

We evaluate all three variants in our quantitative validation testing. 2.3

Surfacing Serendipitous Recommendations from Course Representations

We surface recommendations intended to be interesting but unexpected by finding an objective course cj which is most similar to a student’s favorite course ci , diversifying the results by allowing only one result per department dj : c∗j = arg max cos(c, ci ) c,d(c)=dj (1) where d c

( ) means the the department of course c. Then all the counterpart courses c∗j in all the other departments will be ranked according to cos(c∗j, ci ), where j = 1, 2..., k. We can apply both neural representations and bag-of-words representations of courses in this method to generate the most similar courses in each department. 3 3.1

EXPERIMENTAL ENVIRONMENTS Of-line Dataset

We used a dataset containing anonymized student course enrollments at UC Berkeley from Fall 2008 through Fall 2017. The dataset consists of per-semester course enrollment records for 164,196 students (both undergraduates and graduates) with a total of 4.8 million enrollments. A course enrollment record means that the student was still enrolled in the course at the end of the semester. Students at this university, during this period, were allowed to drop courses up until close to the end of the semester without penalty. The median course load during students’ active semesters was four. There were 9,478 unique lecture courses from 214 departments1 hosted in 17 diferent Divisions of 6 diferent Colleges. Course meta-information contains course number, department name, total enrollment and max capacity. In this paper, we only consider lecture courses with at least 20 enrollments total over the 9-year period, leaving 7,487 courses. Although courses can be categorized as undergraduate courses and graduate courses, undergraduates are permitted to enroll in many graduate courses no matter their status.

Enrollment data were sourced from the campus enterprise data warehouse with course descriptions sourced from the oficial campus course catalog API. We pre-processed the course description data in the following steps: (1) removing generic, often-seen sentences across descriptions (2) removing stop words (3) removing punctuation (4) word lemmatization and stemming, and finally tokenizing the bag-of-words in each course description. We then compiled the term frequency vector, binary value vector, and tf-idf vector for each course. 1At UC Berkeley, the smallest academic unit is called a "subject." For the purpose of communicability, we instead refer to subjects as departments. 3.1.1 Semantic Validation Sets. In order to quantitatively evaluate how accurate the vector models are, a source of ground truth on the relationships between courses needed to brought to bear to see the degree to which the vector representations encoded this information. We used two such sources of ground truth to serve as validation sets, one providing information on similarity, the other on a variety of semantic relationships.

We trained the models described in Section 2.1 on the student enrollment records data. Specifically, we added the instructor(s) who teach the course and the course department as two input features of courses in the multi-factor course2vec model.

To evaluate course vectors on the course equivalency validation set, we fixed the first course in each pair and ranked all the other courses according to their cosine similarity to the first course in • Equivalency validation set: A set of 1,351 course credit-equivalency descending order. We then noted the rank of the expected second pairs maintained by the Ofice of the Registrar were used for course in the pair and described the performance of each model on all validation pairs in terms of mean rank, median rank and recall@10. similarity based ground truth. A course is paired with another course in this set if a student can only receive credit for taking one of the courses. For example, an honors and nonhonors version of a course will be appear as a pair because faculty have deemed that there is too much overlapping material between the two for a student to receive credit for both. • Analogy validation set: The standard method for validating learned word vectors has been to use analogy to test the degree to which the embedding structure contains semantic and syntactic relationships constructed from prior knowledge. In the domain of university courses, we use course relationship pairs constructed from prior work using first-hand knowledge of the courses [ 16 ]. The 77 relationship pairs were in five categories; online, honors, mathematical rigor, 2-department topics, and 3-department topics. An example of an "online" course pair would be Engineering 7 and its online counterpart, Engineering W7 or Education 161 and W161. An analogy involving these two paris could be calculated as: Engineering 7W Engineering 7 + Education 161 ≈ EducationW 161. 3.2

Online Environment (System Overview)

The production recommender system at UC Berkeley uses a student data pipeline with the enterprise data warehouse to keep up-to-date enrollment histories of students. Upon CAS login, these histories are associated with the student and passed through an RNN model, which cross-references the output recommendations with the courses ofered in the target semester. Class availability information is retrieved during the previous semester from a campus API once the registrar has released the schedule. The system is written with an AngularJS front-end and python back-end service which loads the machine learned models written in pyTorch. These models are version controlled on github and refreshed three times per semester after student enrollment status refreshes from the pipeline. The system receives trafic of around 20% of the undergraduate student body, partly from the UC Berkeley Registrar’s website. 4

VECTOR MODEL REFINEMENT EXPERIMENTS

In this section, we first introduce our experiment parameters and the ways we validated the representations quantitatively. Then, we describe the various ways in which we refined the models and the results of these refinement. 4.1

Model Evaluations

For evaluation of the course analogy validation set, we followed the analogy paradigm of: course2 − course1 + course3 ≈ course4. Courses were ranked by their cosine similarity to course2−course1+ course3. An analogy completion is considered accurate (a hit) if the first ranked course is the expected course4 (excluding the other three from the list). We calculated the average accuracy (recall@1) and the recall@10 over all the analogies in the analogy validation set. 4.2

Course2vec vs. Multi-factor Course2vec

We compared the pure course2vec model with the course representations from the multi-factor course2vec model using instructor, department, and both as factors. Full results of evaluation on the equivalency validation and analogy validation are shown in [ 15 ]. The multi-factor model outperformed the pure course2vec model in terms of recall@10 in both validation sets, with the combined instructor and department factor model performing the best. 4.3

Bag-of-words vs. Multi-factor Course2vec

Among the three bag-of-words models, tf-idf performed the best in all equivalency set metrics. The median rank (best=4) and recall@10 (best=0.5647) for the bag-of-words models were also substantially better than the best course2vec models, which had a best median rank of 15 with best recall@10 of 0.4485 for the multi-factor instructor and department model. All course2vec models; however, showed better mean rank performance (best=224) compared with bag-of-words (best=566). This suggests that there are many outliers where literal semantic similarity (bag-of-words) is very poor at identifying equivalent pairs, whereas course2vec has much fewer near worst-case examples. This result is consistent with prior work comparing pure course2vec models to binary bag-of-words [ 14 ].

When considering performance on the analogy validation, the roles are reversed, with all course2vec models performing better than the bag-of-words models in both accuracy and recall@10. The diference in recall of bag-of-words compared to course2vec when it comes to analogies is substantial (0.581 vs 0.8557), a considerably larger diference than between bag-of-words and course2vec on equivalencies (0.5647 vs 0.4485). Again, the multi-factor instructor and department model and tf-idf were the best models in their respective class. These analyses establish that bag-of-words models are moderately superior in capturing course similarity, but are highly inferior to enrollment-based course2vec models in the more complex task of analogy completion. 4.4

Combining Bag-of-words and Course2vec Representations

In light of the strong analogy performance of course2vec and strong equivalency performance bag-of-words in the previous section, we concatenated the multi-factor course2vec representations with bag-of-words representations. To address the diferent magnitudes in the vectors between the two concatenated representations, we create a normalized version of each vector set for comparison to non-normalized sets.

We found that the normalized concatenation of tf-idf with multifactor course2vec performed substantially better on the equivalency test than the previous best model in terms of recall@10 (0.6435 vs. 0.5647). While the median rank of the concatenated model only improved one rank, from 4 to 3, the mean rank improved dramatically (from 566 to 132), and is the best of all models tested in terms of mean rank. Non-normalized vectors did not show improvements over bag-of-words alone in median rank and recall@10. Improvements in the analogy test were more mild, with a recall@10 of 0.8788 for the best concatenated model, combining binary bag-ofwords with multi-factor course2vec, compared with 0.8557 for the best course2vec only model. Normalization in the case of analogies hurt all model performance, the opposite of what was observed in the equivalency test. This suggests that normalization improves local similarity but may act to degrade the more global structure of the vector space. 5

USER STUDY

A user study was conducted to evaluate the quality of recommendations drawn from our diferent course representations. Users rated each course from each recommendation algorithm along five dimensions of quality. Students were asked to rate course recommendations in terms of their (1) unexpectedness (2) successfulness / interest in taking the course (3) novelty (4) diversity of the results (5) and identifiable commonality among the results. In Shani and Gunawardana [ 20 ], authors defined serendipity as the combination of "unexpectedness" and "success." In the case of a song recommender, for example, success would be defined as the user listening to the recommendation. In our case, we use a student’s expression of interest in taking the course as a proxy for success. The mean of their unexpectedness and successfulness rating will comprise our measure of serendipity. We evaluated three of our developed models, all of which displayed 10 results, only showing one course per department in order to increase diversity (and unexpectedness). The models were (1) the best BOW model (tf-idf), (2) the best Analogy validation model (binary BOW + multi-factor course2vec normalized), (3) and the best Equivalency validation model (tf-idf + multi-factor course2vec non-normalized). To measure the impact our department diversification filter would have on serendipity, we added a version of the best Equivalency model that did not impose this filter, allowing multiple courses to be displayed from the same department if they were the most cosine similar to the user’s specified favorite course. Our fifth comparison recommendation algorithm was the system’s existing collaborative-filtering based Recurrent Neural Network (RNN) that recommends courses based on a prediction of what the student is likely to take next given their personal course history and what other students with a similar history have taken in the past [ 14 ]. All five algorithms were integrated into a real-world recommender system for the purpose of this study and evaluated by 70 undergraduates at the University. 5.1

Study Design

Undergraduates were recruited from popular University associated Facebook groups and asked to sign-up for a one hour evaluation session. Since they would need to specify a favorite course they had taken, we restricted participants to those who had been at the University at least one full semester and were currently enrolled. The study was run at the beginning of the Fall semester, while courses could still be added and dropped and some students were still shopping for courses. We used a within-subjects design whereby each volunteer rated ten course recommendations made by each of the ifve algorithms. Because of the considerable number of ratings expected ([3*10+2]*5 = 160) and the importance for students to carefully consider each recommended course, in-person sessions were decided on over asynchronous remote sessions in order to better encourage on-task behavior throughout the session. Student evaluators were compensated with a $40 gift card to attend one of four sessions ofered across three days with a maximum occupancy of 25 in each session. A total of 702 students participated.

We began the session by introducing the evaluation motivation as a means for students to help inform the choice of algorithm that we will use for a future campus-wide deployment of a course exploration tool. Students started the evaluation by visiting a survey URL that asked them to specify a favorite course they had taken at the University. This favorite course was used by the first four algorithms to produce 10 course recommendations each. Each recommended course’s department, course number, title, and full catalog description were displayed to the student in the interface. There was a survey page (Figure 2) for each algorithm in which students were asked to read the recommended course descriptions carefully and then rate each of the ten courses individually on a ifve point Likert scale agreement with the following statements: (1) This course was unexpected (2) I am interested in taking this course (3) I did not know about this course before. These ratings respectively measured unexpectedness, successfulness, and novelty. After rating the individual courses, students were asked to rate their agreement with the following statements pertaining to the 10 results as a whole: (1) Overall, the course results were diverse (2) The course results shared something in common with my favorite course. These ratings measured dimensions of diversity and commonality. Lastly, students were asked to provide an optional follow-up open text response to the question, "If you identified something in common with your favorite course, please explain it here." On the last page of the survey, students were asked to specify their major, year, and to give optional open response feedback on their experience. Graduate courses were not included in the recommendations and the recommendations were not limited to courses available in the current semester. 5.2

Results

Results of average student ratings of the five algorithms across the six measurement categories are shown in Table 1. The diversity based algorithms, denoted by "(div)," all scored higher than the non-diversity (non-div) algorithms in unexpectedness, novelty, diversity, and the primary measure of serendipity. The two nondiversity based algorithms; however, both scored higher than the other three algorithms in successfulness and commonality. All pairwise diferences between diversity and non-diversity algorithms were statistically significant, using the p < 0.001 level after applying a Bonferoni correction for multiple (60) tests. Within the diversity 2Due to an authentication bug during the fourth session, twenty participating students were not able to access the collaborative recommendations of the fifth algorithm. RNN results in the subsequent section are therefore based on the 50 students from the first three sessions. When paired t-tests are conducted between RNN and the ratings of other algorithms, the tests are between ratings among these 50 students. algorithms, there were no statistically significant diferences except for BOW scoring higher than Equivalency (div) on unexpectedness and scoring higher than both Equivalency (div) and Analogy (div) on novelty. Among the two non-diversity algorithms, there were no statistically significant diferences except for the RNN scoring higher on diversity and Equivalency (non-div) recommendations scoring higher on novelty. With respect to measures of serendipity, the div and non-div algorithms had similar scores among their respective strengths (3.473-3.619); however, the non-div algorithms scored substantially lower in their weak category of unexpectedness (2.091 & 2.184) than did the div algorithms in their weak category of successfulness (2.851-2.999), resulting in statistically significantly higher serendipity scores for the div algorithms.

The most dramatic diference can be seen in the measure of novelty, where BOW (div) scored 3.896 and the system’s existing RNN (non-div) scored 1.824, the lowest rating in the results matrix. The proportion of each rating level given to the two algorithms on this question is shown in Figures 3 and 5. Hypothetically, an algorithm that recommended randomly selected courses would score high in both novelty and unexpectedness, and thus it is critical to also weigh their ability to recommend courses that are also of interest to students. Figure 4 shows successfulness ratings for each of the algorithms aggregated by rank of the course result. The non-div algorithms, shown with dotted lines, always perform as well or better than the div algorithms at every rank. The more steeply declining slope of the div algorithms depicts the increasing dificulty of finding courses of interest across diferent departments. The tension between the ability to recommend courses of interest that are also unexpected is shown in Figure 6, where the best serendipitous model, BOW (div), recommends a top course of higher successfulness than unexpectedness, with the two measures intersecting at rank 2 and diverging afterwards. The best equivalency model, combining course description tf-idf and course2vec (non-div), maintains high successfulness but also maintains low unexpectedness across the 10 course recommendation ranks.

Are more senior students less likely to rate courses as novel or unexpected, given they have been at the University longer and been exposed to more courses? Among our sophomore (27), junior (22), and senior (21) level students, there were no statistically significant trends among the six measures, except for a marginally significant trend (p = 0.007, shy of the p < 0.003 threshold given the Bonferroni correction) of more senior students rating recommendations as less unexpected (avg = 2.921) than juniors (avg = 3.024), whose ratings were not statistically separable from sophomores (avg = 3.073). In this section, we attempt to synthesize qualitative characterizations of the diferent algorithms by looking at the open responses students gave to the question asking them to describe any commonalities they saw among recommendations made by each algorithm to their favorite course. 5.3.1 BOW (div). Several students remarked positively about recommendations matching to the themes of "art, philosophy, and society" or "design" exhibited in their favorite course. The word "language" was mentioned by 14 of the 61 respondents answering the open response question. Most of these comments were negative, pointing out the limitations of similarity matching based solely on literal course description matching. The most common critique given in this category was of the foreign spoken language courses that showed up at the lower ranks when students specified a favorite course involving programming languages. Other students remarked at additional dissimilarity when specifying a favorite course related to cyber security, receiving financial security courses in the results. 5.3.2 Analogy (div). The word "interesting" appeared in seven of the 54 comments left by students to describe commonalities among the analogy validation optimized algorithm. This word was not among the top 10 most frequent words in any of the other four algorithms. Several students identified broad themes among the courses that matched to their favorite course, such as "identity" and "societal development." On the other end of the spectrum, one student remarked that the results "felt weird" and were only "vaguely relevant." Another student stated that, "the most interesting suggestion was the Introduction to Embedded Systems [course] which is just diferent enough from my favorite course that it’s interesting but not too diferent that I am not interested," which poignantly articulates the crux of dificulty in striking a balance between interest and unexpectedness to achieve a serendipitous recommendation. 5.3.3 Equivalency (div). Many students (seven of the 55) remarked positively on the commonality of the results with themes of data exhibited by their favorite course (in most cases STATS C8, an introductory data science course). They mentioned how the courses all involved "interacting with data in diferent social, economic, and psychological contexts" and "data analysis with diferent applications." One student remarked on this algorithm’s tendency to match at or around the main topic of the favorite course, further remarking that "they were relevant if looking for a class tangentially related." 5.3.4 Equivalency (non-div). This algorithm was the same as the above, except that it did not limit results to one course per department. Because of this lack of department filter, 15 of the 68 students submitting open text responses to the question of commonality pointed out that the courses returned were all from the same department. Since this model scored highest on a validation task of matching to a credit equivalent course pair (almost always in the same department), it is not surprising that students observed that results from this algorithm tended to all come from the department of the favorite course, which also put it close to their nexus of interest. 5.3.5 RNN (non-div). The RNN scored lowest in novelty, significantly lower than the other non-div algorithm, and was not significantly diferent from the other non-div algorithm in successfulness. In this case, what is the possible utility of the collaborative-based RNN over the non-div Equivalency model? Many of the 47 (of 50) student answers to the open response commonality question remarked that the recommendations related to their major (mentioned by 21 students) and contained courses that fulfilled a requirement (mentioned by seven) as the distinguishing signature of this algorithm. Since the RNN is based on normative next course enrollment behavior, it is reasonable that it suggested many courses that satisfy an unmet requirement. This algorithm’s ability to predict student enrollments accurately became a detriment to some, as seven remarked that it was recommending courses that they were currently enrolled in. Due to the institutional data refresh schedule, student current enrollments are not known until after the add/drop deadline. This may be a shortcoming that can be rectified in the future. 6

FEATURE RE-DESIGN

As a result of the feedback received from the user study, we worked with campus to pull down real-time information on student requirement satisfaction from the Academic Plan Review module of the PeopleSoft Student Information System. We re-framed the RNN feature as a "Requirements" satisfying feature that, upon log-in, shows students their personalized list of unsatisfied requirements (Figure 8). After selecting a requirement category to satisfy, the system displays courses which satisfy the selected requirement and are offered in the target semester. The list of courses is sorted by the RNN to represent the probability that students like them will take the class. This provides a signal to the student of what the normative course taking behavior is in the context of requirement satisfaction. For serendipitous suggestions, we created a separate "Explore" tab (Figure 7) using the BOW (div) model to surface the top five courses similar across departments, due to its strong serendipitous and novelty ratings. The Equivalency (non-div) model was used to display an additional most similar five courses within the same department. This model was chosen due to its strong successfulness ratings. 7

DISCUSSION

Surfacing courses that are of interest but not known before means expanding a student’s knowledge and understanding of the University’s oferings. As students are exposed to courses that veer further from their home department and nexus of interest and understanding, recommendations become less familiar with descriptions that are harder to connect with. This underscores the dificulty of producing an unexpected but interesting course suggestion, as it often must represent a recommendation of uncommon wisdom in order to extend outside of a student’s zone of familiarity surrounding their centers of interest. Big data can be a vehicle for, at times, reaching that wisdom. Are recommendations useful when they suggest something expected or already known? Two distinct sets of responses to this question emerged from student answers to the last open ended feedback question. One representative remark stated, "The best algorithms were the ones that had more diverse options, while still staying true to the core function of the class I was searching. The algorithms that returned classes that were my major requirements/in the same department weren’t as helpful because I already knew of their existence as electives I could be taking" While a diferent representative view was expressed with, "I think the fifth algorithm [RNN] was the best fit for me because my major is pretty standardized" These two comments make a case for both capabilities being of importance. They are also a reminder of the desire among young adults for the socio-technical systems of the university to ofer a balance of information, exploration and, at times, guidance.

8 LIMITATIONS

The more distal a course description is, even if conceptually similar, the less a student may be able to recognize the commonality with a favorite course. A limitation of our study in demonstrating the utility of the neural embedding is that students had to rely on the course description semantics in order to familiarize themselves with the suggested course and determine if they were interested in taking it. If a concept was detected by the neural embedding but not the BOW, this likely meant that the concept was dificult to pick-up from the course description displayed to students. Past work has shown that users evaluate media recommendations less favorably before they take the recommendation than after when important aspects of the recommended content is not described in the recommendation [ 11 ]. Future work could augment recommended course descriptions with additional information, including latent semantics inferred from enrollments [ 5 ] or additional semantics retrieved from available course syllabi.

ACKNOWLEDGMENTS

This work was partly supported by the United States National Science Foundation (1547055/1446641) and the National Natural Science Foundation of China (71772101/71490724).

[1]

Sorathan

Chaturapruek , Thomas Dee, Ramesh Johari, René Kizilcec, and Mitchell Stevens. 2018 . How a data-driven course planning tool afects college students' GPA: evidence from two field experiments . ( 2018 ).

[2] Hung-Hsuan Chen . 2018 . Behavior2Vec: Generating Distributed Representations of UsersâĂŹ Behaviors on Products for Recommender Systems . ACM Transactions on Knowledge Discovery from Data (TKDD) 12 , 4 ( 2018 ), 43 .

[3]

Manning Christopher , Raghavan Prabhakar, and

Schacetzel

Hinrich . 2008 . Introduction to information retrieval . An Introduction To Information Retrieval 151 , 177 ( 2008 ), 5 .

[4]

Martin

Dillon . 1983 . Introduction to modern information retrieval: G. Salton and M. McGill . McGraw-Hill , New York ( 1983 ). 448 pp., ISBN 0-07-054484-0.

[5]

Matt

Dong ,

Run

Yu , and Zach A. Pardos . in press. Design and Deployment of a Better Course Search Tool: Inferring latent keywords from enrollment networks . In Proceedings of the 14th European Conference on Technology Enhanced Learning . Springer.

[6]

Rosta

Farzan and

Peter

Brusilovsky . 2011 . Encouraging user participation in a course recommender system: An impact on user behavior . Computers in Human Behavior 27 , 1 ( 2011 ), 276 - 284 .

[7]

Fei-Fei ,

Rob

Fergus , and

Pietro

Perona . 2006 . One-shot learning of object categories . IEEE transactions on pattern analysis and machine intelligence 28 , 4 ( 2006 ), 594 - 611 .

[8]

Balázs

Hidasi , Massimo Quadrana, Alexandros Karatzoglou, and

Domonkos

Tikk . 2016 . Parallel recurrent neural network architectures for feature-rich session-based recommendations . In Proceedings of the 10th ACM Conference on Recommender Systems. ACM , 241 - 248 .

[9]

Weijie

Jiang , Zachary A Pardos, and

Qiang

Wei . 2019 . Goal-based course recommendation . In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM , 36 - 45 .

[10]

Judy

Kay . 2000 . Stereotypes, student models and scrutability . In International Conference on Intelligent Tutoring Systems . Springer, 19 - 30 .

[11] Benedikt

Loepp

, Tim Donkers, Timm Kleemann, and

Jürgen

Ziegler . 2018 . Impact of item consumption on assessment of recommendations in user studies . In Proceedings of the 12th ACM Conference on Recommender Systems. ACM , 49 - 53 .

[12] Tien

T Nguyen

, Pik-Mai Hui , F Maxwell Harper , Loren Terveen, and Joseph A Konstan . 2014 . Exploring the filter bubble: the efect of using recommender systems on content diversity . In Proceedings of the 23rd international conference on World wide web. ACM , 677 - 686 .

[13] Aditya

Parameswaran

, Petros Venetis, and Hector Garcia-Molina. 2011 . Recommendation systems with complex constraints: A course recommendation perspective . ACM Transactions on Information Systems (TOIS) 29 , 4 ( 2011 ), 20 .

[14] Zachary

A Pardos

Zihao

Fan , and

Weijie

Jiang . 2019 . Connectionist recommendation in the wild: on the utility and scrutability of neural networks for personalized course guidance . User Modeling and User-Adapted Interaction 29 , 2 ( 2019 ), 487 - 525 .

[15] Zachary

A Pardos

and

Weijie

Jiang . 2019 . Combating the Filter Bubble: Designing for Serendipity in a University Course Recommendation System . arXiv preprint arXiv: 1907 . 01591 ( 2019 ).

[16] Zachary

Pardos and Andrew Joo Hun Nam . 2018 . A Map of Knowledge. CoRR preprint , abs/ 1811 .07974 ( 2018 ). https://arxiv.org/abs/ 1811 .07974

[17]

Agoritsa

Polyzou ,

Athanasios , and

George

Karypis . 2019 . Scholars Walk: A Markov Chain Framework for Course Recommendation . In Proceedings of the 12th International Conference on Educational Data Mining . 396 - 401 .

[18] Zhiyun

Ren

, Xia Ning, Andrew S Lan, and

Huzefa

Rangwala . 2019 . Grade Prediction Based on Cumulative Knowledge and Co-taken Courses . In Proceedings of the 12th International Conference on Educational Data Mining . 158 - 167 .

[19] Leonardo

FR Ribeiro

, Pedro HP Saverese, and Daniel R Figueiredo. 2017 . struc2vec: Learning node representations from structural identity . In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM , 385 - 394 .

[20]

Guy

Shani and

Asela

Gunawardana . 2011 . Evaluating recommendation systems . In Recommender systems handbook . Springer, 257 - 297 .

[21] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin . 2017 . Attention is all you need . In Advances in neural information processing systems . 5998 - 6008 .

[22] Oriol

Vinyals

, Charles Blundell, Tim Lillicrap,

Daan

Wierstra , et al. 2016 . Matching networks for one shot learning . In Advances in Neural Information Processing Systems . 3630 - 3638 .

[23]

Yuan

Cao Zhang , Diarmuid Ó Séaghdha, Daniele Quercia, and

Tamas

Jambor . 2012 . Auralist: introducing serendipity into music recommendation . In Proceedings of the fifth ACM international conference on Web search and data mining . ACM , 13 - 22 .