=Paper=
{{Paper
|id=Vol-2450/paper3
|storemode=property
|title=Designing for Serendipity in a University Course Recommendation System
|pdfUrl=https://ceur-ws.org/Vol-2450/paper3.pdf
|volume=Vol-2450
|authors=Zach Pardos,Weijie Jiang
|dblpUrl=https://dblp.org/rec/conf/recsys/PardosJ19
}}
==Designing for Serendipity in a University Course Recommendation System==
Designing for Serendipity in a University Course Recommendation System Zachary Pardos Weijie Jiang University of California, Berkeley Tsinghua University & University of California, Berkeley zp@berkeley.edu jiangwj.14@sem.tsinghua.edu.cn ABSTRACT made from collaborative based methods. This problem of training Collaborative filtering based algorithms, including Recurrent Neu- on the past without necessarily repeating it is an open problem ral Networks (RNN), tend towards predicting a perpetuation of in many collaborative filtering based recommendation contexts, past observed behavior. In a recommendation context, this can lead particularly social networks, where, in the degenerate cases, users to an overly narrow set of suggestions lacking in serendipity and can get caught in "filter bubbles," or model-based user stereotypes, inadvertently placing the user in what is known as a "filter bubble." leading to a narrowing of item recommendation variety [10, 12, 23]. In this paper, we grapple with the issue of the filter bubble in the To counteract the filter bubble, we introduce a course2vec vari- context of a course recommendation system in production at a ant into a production recommender system at a public university public university. Our approach is to present course results that designed to surface serendipitous course suggestions. Course2vec are novel or unexpected to the student but still relevant to their applies a skip-gram to course enrollment histories, instead of nat- interests. We build one set of models based on the course catalog ural language, in order to learn representations. We use the def- description (BOW) and another set informed by enrollment histo- inition of serendipity as user perceived unexpectedness of result ries (course2vec). We compare the performance of these models on combined with successfulness [20], which we define as a course off-line validation sets and against the system’s existing RNN-based recommendation a student expresses interest in taking. At many recommendation engine in a user study of undergraduates (N = 70) universities, conceptually similar courses exist across departments who rated their course recommendations along six characteristics but use widely differing disciplinary vernacular in their catalog related to serendipity. Results of the user study show a dramatic descriptions, making them difficult for learners to search for and lack of novelty in RNN recommendations and depict the charac- to realize their commonality. We propose that by tuning a vector teristic trade-offs that make serendipity difficult to achieve. While representation of courses learned from nine years of enrollment the machine learned course2vec models performed best on con- sequences, we can capture enough implicit semantics of the courses cept generalization tasks (i.e, course analogies), it was the simple to more abstractly, and accurately construe similarity. To encourage bag-of-words based recommendations that students rated as more the embedding to learn features that may generalize across depart- serendipitous. We discuss the role of the recommendation interface ments, our skip-gram variant simultaneously learns department and the information presented therein in the student’s decision to (and instructor) embeddings. While more advanced attention-based accept a recommendation from either algorithm. text generation architectures exist [21], we demonstrate that prop- erties of the linear vector space produced by "shallow" networks are CCS CONCEPTS of utility to this recommendation task. Our recommendations are made with only a single explicit course preference given by the user, • Applied computing → Education; • Information systems as opposed to the entire course selection history needed by session- → Recommender systems. based Recurrent Neural Network approaches [8]. Single example, KEYWORDS also known as "one-shot," generalization is common in the vision community, which has pioneered approaches to extrapolating a Higher education, course guidance, filter bubble, neural networks category from a single labeled example [7, 22]. Other related work applying skip-grams to non-linguistic data include node embed- 1 INTRODUCTION dings learned from sequences of random walks of graphs [19] and Among the institutional values of a liberal arts university is to product embeddings learned from ecommerce clickstream [2]. Our expose students to a variety of perspectives expressed in courses work, methodologically, adds rigor to this approach by tuning the across its various physical and intellectual schools of thought. Col- model against validation sets created from institutional knowledge laborative filtering based sequence prediction methods, in this envi- and curated by the university. ronment, can provide personalized course recommendations based We conduct a user study (N = 70) of undergraduates at the Univer- on temporal models of normative behavior [14] but are not well sity to evaluate their personalized course recommendations made suited for surfacing courses a student may find interesting but by our models designed for serendipity and by the RNN-based which have been relatively unexplored by those with similar course engine, which previously drove recommendations in the system. selections to them in the past. Therefore, a more diversity oriented The findings underscore the tension between unexpectedness and model can serve as an appropriate compliment to recommendations successfulness and show the deficiency of RNNs for producing Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons novel recommendations. While our course2vec based model scored License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop on Interfaces 68% above bag-of-words in accuracy in one of our course analogy and Human Decision Making for Recommender Systems, 19 Sept 2019, Copenhagen, DK validation set, simple bag-of-words scored slightly higher in the . IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang main objective of user perceived serendipity. A potential reason for this discrepancy is the nature of information presented to stu- dents in the recommender system interface. Catalog descriptions of recommended courses were shown to students, which served as the only source of information they could consult in deciding if they wanted to take the course. A generated explanation, or prior- itization of the course2vec recommendation in the interface may be required to overcome the advantage of the bag-of-words model being based on the same information being shown to them in the recommendations. Recommender systems in higher education contexts have re- cently focused on prediction of which courses a student will take [14, 17] or the grade they will receive if enrolled [9, 18]. At Stanford, a system called "CARTA" allows students to see grade distributions, course evaluations, and the most common courses taken before a course of interest [1]. At UC Berkeley, the recommender system being modified in this study serves students next-semester course Figure 1: multi-factor course2vec model considerations based on their personal course enrollment history [14]. Earlier systems included a focus on requirement satisfaction [13] and career-based relevancy recommendation [6]. No system has yet focused on serendipitous or novel course discovery. course. In this section, we consider adding features of a course to the input to enhance the classifier and its representations, as shown in Figure 1. Each course is taught by one or several instructors 2 MODELS AND METHODOLOGY over the years and is associated with an academic department. This section introduces three competing models used to generate The multi-factor course2vec model learns both course and course our representations. The first model uses course2vec [14] to learn feature representations by maximizing the objective function over course representations from enrollment sequences. Our second all the students’ enrollment sequences and the features of courses. model is a variant on course2vec, which learns representations of Full technical details can be found in [15]. explicitly defined features of a course (e.g., instructor or department) In language models, two word vectors will be cosine similar if in addition to the course representation. The intuition behind this they share similar sentence contexts. Likewise, in the university approach is that the course representation could have, conflated domain, courses that share similar co-enrollments, and similar pre- in it, the influence of the multiple-instructors that have taught vious and next semester enrollments, will likely be close to one the course over time. We contend that this "deconflation" may another in the vector space. increase the fidelity of the course representation and serve as a more accurate representation of the topical essence of the course. 2.2 Bag-of-Words The last representation model is a standard bag-of-words vector, constructed for each course strictly from its catalog description. A simple but indelible approach to item representation has been to Finally, we explore concatenating a course’s course2vec and bag-of- create a vector, the length of the number of unique words across all words representation vector. items, with a non-zero value if the word in the vocabulary appears in it. Only unigram words are used to create this unordered vector list of words used to represent the document [3]. 2.1 Course2vec The basic methodology based on bag-of words proposed by IR The course2vec model involves learning distributed representations researchers for text corpora - a methodology successfully deployed of courses from students’ enrollment records throughout semesters in modern Internet search engines - reduces each document in the by using a notion of an enrollment sequence as a "sentence" and corpus to a vector of real numbers, each of which represents a term courses within the sequence as "words", borrowing terminology weight. The term weight might be: from the linguistic domain. For each student s, a chronological course enrollment sequence is produced by first sorting by semes- • a term frequency value indicating how many times the term ter then randomly serializing within-semester course order. Then, occurred in the document. each course enrollment sequence is used in training, similar to a • a binary value with 1 indicating that the term occurred in document in a classical skip-gram application. the document, and 0 indicating that it did not. The training objective of the skip-gram model is to find word • tf-idf scheme [4], the product of term frequency and inverse representations that are useful for predicting the surrounding words document frequency, which increases proportionally to the in a sentence or a document. Each word in the corpus is used as an number of times a word appears in the document and is input to a log-linear classifier with continuous projection layer, to offset by the frequency of the word in the corpus and helps predict words within a certain range before and after the current to adjust for the fact that some words appear more frequently word. Therefore, the skip-gram model can be also deemed as a in general. classifier with input as a target course and output as a context We evaluate all three variants in our quantitative validation testing. IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang 2.3 Surfacing Serendipitous Recommendations similarity based ground truth. A course is paired with an- from Course Representations other course in this set if a student can only receive credit for taking one of the courses. For example, an honors and non- We surface recommendations intended to be interesting but unex- honors version of a course will be appear as a pair because pected by finding an objective course c j which is most similar to faculty have deemed that there is too much overlapping ma- a student’s favorite course c i , diversifying the results by allowing terial between the two for a student to receive credit for only one result per department d j : both. c ∗j = arg max cos(c, c i ) (1) • Analogy validation set: The standard method for validating learned c,d (c)=d j word vectors has been to use analogy to test the degree to which the where d(c) means the the department of course c. Then all the embedding structure contains semantic and syntactic relationships counterpart courses c ∗j in all the other departments will be ranked constructed from prior knowledge. In the domain of university according to cos(c ∗j , c i ), where j = 1, 2..., k. We can apply both neu- courses, we use course relationship pairs constructed from prior ral representations and bag-of-words representations of courses in work using first-hand knowledge of the courses [16]. The 77 rela- this method to generate the most similar courses in each depart- tionship pairs were in five categories; online, honors, mathematical ment. rigor, 2-department topics, and 3-department topics. An example of an "online" course pair would be Engineering 7 and its online coun- 3 EXPERIMENTAL ENVIRONMENTS terpart, Engineering W7 or Education 161 and W161. An analogy involving these two paris could be calculated as: Engineering 7W - 3.1 Off-line Dataset Engineering 7 + Education 161 ≈ EducationW 161. We used a dataset containing anonymized student course enroll- ments at UC Berkeley from Fall 2008 through Fall 2017. The dataset 3.2 Online Environment (System Overview) consists of per-semester course enrollment records for 164,196 stu- The production recommender system at UC Berkeley uses a stu- dents (both undergraduates and graduates) with a total of 4.8 million dent data pipeline with the enterprise data warehouse to keep enrollments. A course enrollment record means that the student up-to-date enrollment histories of students. Upon CAS login, these was still enrolled in the course at the end of the semester. Students at histories are associated with the student and passed through an this university, during this period, were allowed to drop courses up RNN model, which cross-references the output recommendations until close to the end of the semester without penalty. The median with the courses offered in the target semester. Class availability course load during students’ active semesters was four. There were information is retrieved during the previous semester from a cam- 9,478 unique lecture courses from 214 departments1 hosted in 17 pus API once the registrar has released the schedule. The system is different Divisions of 6 different Colleges. Course meta-information written with an AngularJS front-end and python back-end service contains course number, department name, total enrollment and which loads the machine learned models written in pyTorch. These max capacity. In this paper, we only consider lecture courses with models are version controlled on github and refreshed three times at least 20 enrollments total over the 9-year period, leaving 7,487 per semester after student enrollment status refreshes from the courses. Although courses can be categorized as undergraduate pipeline. The system receives traffic of around 20% of the under- courses and graduate courses, undergraduates are permitted to graduate student body, partly from the UC Berkeley Registrar’s enroll in many graduate courses no matter their status. website. Enrollment data were sourced from the campus enterprise data warehouse with course descriptions sourced from the official cam- 4 VECTOR MODEL REFINEMENT pus course catalog API. We pre-processed the course description data in the following steps: (1) removing generic, often-seen sen- EXPERIMENTS tences across descriptions (2) removing stop words (3) removing In this section, we first introduce our experiment parameters and punctuation (4) word lemmatization and stemming, and finally the ways we validated the representations quantitatively. Then, we tokenizing the bag-of-words in each course description. We then describe the various ways in which we refined the models and the compiled the term frequency vector, binary value vector, and tf-idf results of these refinement. vector for each course. 4.1 Model Evaluations 3.1.1 Semantic Validation Sets. In order to quantitatively evaluate We trained the models described in Section 2.1 on the student how accurate the vector models are, a source of ground truth on enrollment records data. Specifically, we added the instructor(s) the relationships between courses needed to brought to bear to who teach the course and the course department as two input see the degree to which the vector representations encoded this features of courses in the multi-factor course2vec model. information. We used two such sources of ground truth to serve as To evaluate course vectors on the course equivalency validation validation sets, one providing information on similarity, the other set, we fixed the first course in each pair and ranked all the other on a variety of semantic relationships. courses according to their cosine similarity to the first course in • Equivalency validation set: A set of 1,351 course credit-equivalency descending order. We then noted the rank of the expected second pairs maintained by the Office of the Registrar were used for course in the pair and described the performance of each model 1 At UC Berkeley, the smallest academic unit is called a "subject." For the purpose of on all validation pairs in terms of mean rank, median rank and communicability, we instead refer to subjects as departments. recall@10. IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang For evaluation of the course analogy validation set, we followed 0.5647). While the median rank of the concatenated model only im- the analogy paradigm of: course2 − course1 + course3 ≈ course4. proved one rank, from 4 to 3, the mean rank improved dramatically Courses were ranked by their cosine similarity to course2−course1+ (from 566 to 132), and is the best of all models tested in terms of course3. An analogy completion is considered accurate (a hit) if mean rank. Non-normalized vectors did not show improvements the first ranked course is the expected course4 (excluding the other over bag-of-words alone in median rank and recall@10. Improve- three from the list). We calculated the average accuracy (recall@1) ments in the analogy test were more mild, with a recall@10 of and the recall@10 over all the analogies in the analogy validation 0.8788 for the best concatenated model, combining binary bag-of- set. words with multi-factor course2vec, compared with 0.8557 for the best course2vec only model. Normalization in the case of analogies 4.2 Course2vec vs. Multi-factor Course2vec hurt all model performance, the opposite of what was observed in We compared the pure course2vec model with the course represen- the equivalency test. This suggests that normalization improves tations from the multi-factor course2vec model using instructor, local similarity but may act to degrade the more global structure of department, and both as factors. Full results of evaluation on the the vector space. equivalency validation and analogy validation are shown in [15]. The multi-factor model outperformed the pure course2vec model 5 USER STUDY in terms of recall@10 in both validation sets, with the combined instructor and department factor model performing the best. A user study was conducted to evaluate the quality of recommen- dations drawn from our different course representations. Users 4.3 Bag-of-words vs. Multi-factor Course2vec rated each course from each recommendation algorithm along five dimensions of quality. Students were asked to rate course recom- Among the three bag-of-words models, tf-idf performed the best in mendations in terms of their (1) unexpectedness (2) successfulness all equivalency set metrics. The median rank (best=4) and recall@10 / interest in taking the course (3) novelty (4) diversity of the results (best=0.5647) for the bag-of-words models were also substantially (5) and identifiable commonality among the results. In Shani and better than the best course2vec models, which had a best median Gunawardana [20], authors defined serendipity as the combination rank of 15 with best recall@10 of 0.4485 for the multi-factor in- of "unexpectedness" and "success." In the case of a song recom- structor and department model. All course2vec models; however, mender, for example, success would be defined as the user listening showed better mean rank performance (best=224) compared with to the recommendation. In our case, we use a student’s expression bag-of-words (best=566). This suggests that there are many outliers of interest in taking the course as a proxy for success. The mean where literal semantic similarity (bag-of-words) is very poor at of their unexpectedness and successfulness rating will comprise identifying equivalent pairs, whereas course2vec has much fewer our measure of serendipity. We evaluated three of our developed near worst-case examples. This result is consistent with prior work models, all of which displayed 10 results, only showing one course comparing pure course2vec models to binary bag-of-words [14]. per department in order to increase diversity (and unexpected- When considering performance on the analogy validation, the ness). The models were (1) the best BOW model (tf-idf), (2) the best roles are reversed, with all course2vec models performing better Analogy validation model (binary BOW + multi-factor course2vec than the bag-of-words models in both accuracy and recall@10. The normalized), (3) and the best Equivalency validation model (tf-idf + difference in recall of bag-of-words compared to course2vec when multi-factor course2vec non-normalized). To measure the impact it comes to analogies is substantial (0.581 vs 0.8557), a considerably our department diversification filter would have on serendipity, larger difference than between bag-of-words and course2vec on we added a version of the best Equivalency model that did not equivalencies (0.5647 vs 0.4485). Again, the multi-factor instructor impose this filter, allowing multiple courses to be displayed from and department model and tf-idf were the best models in their the same department if they were the most cosine similar to the respective class. These analyses establish that bag-of-words models user’s specified favorite course. Our fifth comparison recommen- are moderately superior in capturing course similarity, but are dation algorithm was the system’s existing collaborative-filtering highly inferior to enrollment-based course2vec models in the more based Recurrent Neural Network (RNN) that recommends courses complex task of analogy completion. based on a prediction of what the student is likely to take next given their personal course history and what other students with a 4.4 Combining Bag-of-words and Course2vec similar history have taken in the past [14]. All five algorithms were Representations integrated into a real-world recommender system for the purpose In light of the strong analogy performance of course2vec and strong of this study and evaluated by 70 undergraduates at the University. equivalency performance bag-of-words in the previous section, we concatenated the multi-factor course2vec representations with bag-of-words representations. To address the different magnitudes 5.1 Study Design in the vectors between the two concatenated representations, we Undergraduates were recruited from popular University associated create a normalized version of each vector set for comparison to Facebook groups and asked to sign-up for a one hour evaluation non-normalized sets. session. Since they would need to specify a favorite course they had We found that the normalized concatenation of tf-idf with multi- taken, we restricted participants to those who had been at the Uni- factor course2vec performed substantially better on the equivalency versity at least one full semester and were currently enrolled. The test than the previous best model in terms of recall@10 (0.6435 vs. study was run at the beginning of the Fall semester, while courses IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang could still be added and dropped and some students were still shop- ping for courses. We used a within-subjects design whereby each volunteer rated ten course recommendations made by each of the five algorithms. Because of the considerable number of ratings expected ([3*10+2]*5 = 160) and the importance for students to carefully consider each recommended course, in-person sessions were decided on over asynchronous remote sessions in order to better encourage on-task behavior throughout the session. Student evaluators were compensated with a $40 gift card to attend one of four sessions offered across three days with a maximum occupancy of 25 in each session. A total of 702 students participated. We began the session by introducing the evaluation motivation as a means for students to help inform the choice of algorithm that we will use for a future campus-wide deployment of a course exploration tool. Students started the evaluation by visiting a sur- vey URL that asked them to specify a favorite course they had taken at the University. This favorite course was used by the first four algorithms to produce 10 course recommendations each. Each recommended course’s department, course number, title, and full Figure 2: User survey page catalog description were displayed to the student in the interface. There was a survey page (Figure 2) for each algorithm in which students were asked to read the recommended course descriptions carefully and then rate each of the ten courses individually on a algorithms, there were no statistically significant differences except five point Likert scale agreement with the following statements: for BOW scoring higher than Equivalency (div) on unexpectedness (1) This course was unexpected (2) I am interested in taking this and scoring higher than both Equivalency (div) and Analogy (div) course (3) I did not know about this course before. These ratings on novelty. Among the two non-diversity algorithms, there were respectively measured unexpectedness, successfulness, and novelty. no statistically significant differences except for the RNN scoring After rating the individual courses, students were asked to rate higher on diversity and Equivalency (non-div) recommendations their agreement with the following statements pertaining to the scoring higher on novelty. With respect to measures of serendipity, 10 results as a whole: (1) Overall, the course results were diverse the div and non-div algorithms had similar scores among their re- (2) The course results shared something in common with my fa- spective strengths (3.473-3.619); however, the non-div algorithms vorite course. These ratings measured dimensions of diversity and scored substantially lower in their weak category of unexpectedness commonality. Lastly, students were asked to provide an optional (2.091 & 2.184) than did the div algorithms in their weak category of follow-up open text response to the question, "If you identified successfulness (2.851-2.999), resulting in statistically significantly something in common with your favorite course, please explain it higher serendipity scores for the div algorithms. here." On the last page of the survey, students were asked to specify The most dramatic difference can be seen in the measure of nov- their major, year, and to give optional open response feedback on elty, where BOW (div) scored 3.896 and the system’s existing RNN their experience. Graduate courses were not included in the recom- (non-div) scored 1.824, the lowest rating in the results matrix. The mendations and the recommendations were not limited to courses proportion of each rating level given to the two algorithms on this available in the current semester. question is shown in Figures 3 and 5. Hypothetically, an algorithm that recommended randomly selected courses would score high 5.2 Results in both novelty and unexpectedness, and thus it is critical to also Results of average student ratings of the five algorithms across weigh their ability to recommend courses that are also of interest the six measurement categories are shown in Table 1. The diver- to students. Figure 4 shows successfulness ratings for each of the sity based algorithms, denoted by "(div)," all scored higher than algorithms aggregated by rank of the course result. The non-div the non-diversity (non-div) algorithms in unexpectedness, novelty, algorithms, shown with dotted lines, always perform as well or diversity, and the primary measure of serendipity. The two non- better than the div algorithms at every rank. The more steeply de- diversity based algorithms; however, both scored higher than the clining slope of the div algorithms depicts the increasing difficulty other three algorithms in successfulness and commonality. All pair- of finding courses of interest across different departments. The ten- wise differences between diversity and non-diversity algorithms sion between the ability to recommend courses of interest that are were statistically significant, using the p < 0.001 level after applying also unexpected is shown in Figure 6, where the best serendipitous a Bonferoni correction for multiple (60) tests. Within the diversity model, BOW (div), recommends a top course of higher success- fulness than unexpectedness, with the two measures intersecting 2 Due to an authentication bug during the fourth session, twenty participating students at rank 2 and diverging afterwards. The best equivalency model, were not able to access the collaborative recommendations of the fifth algorithm. RNN combining course description tf-idf and course2vec (non-div), main- results in the subsequent section are therefore based on the 50 students from the first three sessions. When paired t-tests are conducted between RNN and the ratings of tains high successfulness but also maintains low unexpectedness other algorithms, the tests are between ratings among these 50 students. across the 10 course recommendation ranks. IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang Table 1: Average student ratings of recommendations from the five algorithms across the six measurement categories. algorithm unexpectedness successfulness serendipity novelty diversity commonality BOW (div) 3.550 2.904 3.227 3.896 4.229 3.229 Analogy (div) 3.473 2.851 3.162 3.310 4.286 2.986 Equivalency (div) 3.297 2.999 3.148 3.323 4.214 3.257 Equivalency (non-div) 2.091 3.619 2.855 2.559 2.457 4.500 RNN (non-div) 2.184 3.566 2.875 1.824 3.160 4.140 Figure 3: Student novelty rating proportions of course rec- Figure 5: Student novelty rating proportions of course rec- ommendations produced by BOW (div) ommendations produced by RNN (non-div) Figure 6: BOW (div) vs. Equivalency (non-div) comparison 5.3 Qualitative Characterization of Algorithms Figure 4: Successfulness comparison In this section, we attempt to synthesize qualitative characteriza- tions of the different algorithms by looking at the open responses students gave to the question asking them to describe any common- alities they saw among recommendations made by each algorithm Are more senior students less likely to rate courses as novel or to their favorite course. unexpected, given they have been at the University longer and been exposed to more courses? Among our sophomore (27), junior (22), 5.3.1 BOW (div). Several students remarked positively about rec- and senior (21) level students, there were no statistically significant ommendations matching to the themes of "art, philosophy, and trends among the six measures, except for a marginally significant society" or "design" exhibited in their favorite course. The word trend (p = 0.007, shy of the p < 0.003 threshold given the Bonferroni "language" was mentioned by 14 of the 61 respondents answering correction) of more senior students rating recommendations as less the open response question. Most of these comments were negative, unexpected (avg = 2.921) than juniors (avg = 3.024), whose ratings pointing out the limitations of similarity matching based solely were not statistically separable from sophomores (avg = 3.073). on literal course description matching. The most common critique IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang given in this category was of the foreign spoken language courses enrolled in. Due to the institutional data refresh schedule, student that showed up at the lower ranks when students specified a fa- current enrollments are not known until after the add/drop deadline. vorite course involving programming languages. Other students This may be a shortcoming that can be rectified in the future. remarked at additional dissimilarity when specifying a favorite course related to cyber security, receiving financial security courses 6 FEATURE RE-DESIGN in the results. As a result of the feedback received from the user study, we worked 5.3.2 Analogy (div). The word "interesting" appeared in seven of with campus to pull down real-time information on student require- the 54 comments left by students to describe commonalities among ment satisfaction from the Academic Plan Review module of the the analogy validation optimized algorithm. This word was not PeopleSoft Student Information System. We re-framed the RNN fea- among the top 10 most frequent words in any of the other four ture as a "Requirements" satisfying feature that, upon log-in, shows algorithms. Several students identified broad themes among the students their personalized list of unsatisfied requirements (Figure courses that matched to their favorite course, such as "identity" and 8). After selecting a requirement category to satisfy, the system "societal development." On the other end of the spectrum, one stu- displays courses which satisfy the selected requirement and are of- dent remarked that the results "felt weird" and were only "vaguely fered in the target semester. The list of courses is sorted by the RNN relevant." Another student stated that, "the most interesting sugges- to represent the probability that students like them will take the tion was the Introduction to Embedded Systems [course] which is class. This provides a signal to the student of what the normative just different enough from my favorite course that it’s interesting course taking behavior is in the context of requirement satisfaction. but not too different that I am not interested," which poignantly ar- For serendipitous suggestions, we created a separate "Explore" tab ticulates the crux of difficulty in striking a balance between interest (Figure 7) using the BOW (div) model to surface the top five courses and unexpectedness to achieve a serendipitous recommendation. similar across departments, due to its strong serendipitous and nov- 5.3.3 Equivalency (div). Many students (seven of the 55) remarked elty ratings. The Equivalency (non-div) model was used to display positively on the commonality of the results with themes of data an additional most similar five courses within the same department. exhibited by their favorite course (in most cases STATS C8, an in- This model was chosen due to its strong successfulness ratings. troductory data science course). They mentioned how the courses all involved "interacting with data in different social, economic, 7 DISCUSSION and psychological contexts" and "data analysis with different ap- Surfacing courses that are of interest but not known before means plications." One student remarked on this algorithm’s tendency to expanding a student’s knowledge and understanding of the Univer- match at or around the main topic of the favorite course, further re- sity’s offerings. As students are exposed to courses that veer further marking that "they were relevant if looking for a class tangentially from their home department and nexus of interest and understand- related." ing, recommendations become less familiar with descriptions that 5.3.4 Equivalency (non-div). This algorithm was the same as the are harder to connect with. This underscores the difficulty of pro- above, except that it did not limit results to one course per depart- ducing an unexpected but interesting course suggestion, as it often ment. Because of this lack of department filter, 15 of the 68 students must represent a recommendation of uncommon wisdom in order submitting open text responses to the question of commonality to extend outside of a student’s zone of familiarity surrounding pointed out that the courses returned were all from the same de- their centers of interest. Big data can be a vehicle for, at times, partment. Since this model scored highest on a validation task of reaching that wisdom. Are recommendations useful when they matching to a credit equivalent course pair (almost always in the suggest something expected or already known? Two distinct sets of same department), it is not surprising that students observed that responses to this question emerged from student answers to the last results from this algorithm tended to all come from the department open ended feedback question. One representative remark stated, of the favorite course, which also put it close to their nexus of "The best algorithms were the ones that had more interest. diverse options, while still staying true to the core 5.3.5 RNN (non-div). The RNN scored lowest in novelty, signifi- function of the class I was searching. The algo- cantly lower than the other non-div algorithm, and was not signifi- rithms that returned classes that were my major cantly different from the other non-div algorithm in successfulness. requirements/in the same department weren’t as In this case, what is the possible utility of the collaborative-based helpful because I already knew of their existence RNN over the non-div Equivalency model? Many of the 47 (of 50) as electives I could be taking" student answers to the open response commonality question re- marked that the recommendations related to their major (mentioned While a different representative view was expressed with, by 21 students) and contained courses that fulfilled a requirement "I think the fifth algorithm [RNN] was the best fit (mentioned by seven) as the distinguishing signature of this algo- for me because my major is pretty standardized" rithm. Since the RNN is based on normative next course enrollment behavior, it is reasonable that it suggested many courses that satisfy These two comments make a case for both capabilities being of an unmet requirement. This algorithm’s ability to predict student importance. They are also a reminder of the desire among young enrollments accurately became a detriment to some, as seven re- adults for the socio-technical systems of the university to offer a marked that it was recommending courses that they were currently balance of information, exploration and, at times, guidance. IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang Figure 8: The “Requirements" Interface utility of the neural embedding is that students had to rely on the course description semantics in order to familiarize themselves with the suggested course and determine if they were interested in taking it. If a concept was detected by the neural embedding but not the BOW, this likely meant that the concept was difficult to pick-up from the course description displayed to students. Past work has shown that users evaluate media recommendations less favorably before they take the recommendation than after when im- portant aspects of the recommended content is not described in the recommendation [11]. Future work could augment recommended course descriptions with additional information, including latent semantics inferred from enrollments [5] or additional semantics retrieved from available course syllabi. ACKNOWLEDGMENTS This work was partly supported by the United States National Science Foundation (1547055/1446641) and the National Natural Science Foundation of China (71772101/71490724). REFERENCES [1] Sorathan Chaturapruek, Thomas Dee, Ramesh Johari, René Kizilcec, and Mitchell Stevens. 2018. How a data-driven course planning tool affects college students’ GPA: evidence from two field experiments. (2018). Figure 7: The “Explore" Interface [2] Hung-Hsuan Chen. 2018. Behavior2Vec: Generating Distributed Representations of UsersâĂŹ Behaviors on Products for Recommender Systems. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 43. [3] D Manning Christopher, Raghavan Prabhakar, and Schacetzel Hinrich. 2008. Introduction to information retrieval. An Introduction To Information Retrieval 8 LIMITATIONS 151, 177 (2008), 5. The more distal a course description is, even if conceptually similar, [4] Martin Dillon. 1983. Introduction to modern information retrieval: G. Salton and M. McGill. McGraw-Hill, New York (1983). 448 pp., ISBN 0-07-054484-0. the less a student may be able to recognize the commonality with [5] Matt Dong, Run Yu, and Zach A. Pardos. in press. Design and Deployment of a a favorite course. A limitation of our study in demonstrating the Better Course Search Tool: Inferring latent keywords from enrollment networks. IntRS Workshop, September 2019, Copenhagen, DK Pardos and Jiang In Proceedings of the 14th European Conference on Technology Enhanced Learning. personalized course guidance. User Modeling and User-Adapted Interaction 29, 2 Springer. (2019), 487–525. [6] Rosta Farzan and Peter Brusilovsky. 2011. Encouraging user participation in a [15] Zachary A Pardos and Weijie Jiang. 2019. Combating the Filter Bubble: Designing course recommender system: An impact on user behavior. Computers in Human for Serendipity in a University Course Recommendation System. arXiv preprint Behavior 27, 1 (2011), 276–284. arXiv:1907.01591 (2019). [7] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object [16] Zachary A Pardos and Andrew Joo Hun Nam. 2018. A Map of Knowledge. CoRR categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 preprint, abs/1811.07974 (2018). https://arxiv.org/abs/1811.07974 (2006), 594–611. [17] Agoritsa Polyzou, N Athanasios, and George Karypis. 2019. Scholars Walk: A [8] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Markov Chain Framework for Course Recommendation. In Proceedings of the Tikk. 2016. Parallel recurrent neural network architectures for feature-rich 12th International Conference on Educational Data Mining. 396–401. session-based recommendations. In Proceedings of the 10th ACM Conference on [18] Zhiyun Ren, Xia Ning, Andrew S Lan, and Huzefa Rangwala. 2019. Grade Recommender Systems. ACM, 241–248. Prediction Based on Cumulative Knowledge and Co-taken Courses. In Proceedings [9] Weijie Jiang, Zachary A Pardos, and Qiang Wei. 2019. Goal-based course rec- of the 12th International Conference on Educational Data Mining. 158–167. ommendation. In Proceedings of the 9th International Conference on Learning [19] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec: Analytics & Knowledge. ACM, 36–45. Learning node representations from structural identity. In Proceedings of the 23rd [10] Judy Kay. 2000. Stereotypes, student models and scrutability. In International ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Conference on Intelligent Tutoring Systems. Springer, 19–30. ACM, 385–394. [11] Benedikt Loepp, Tim Donkers, Timm Kleemann, and Jürgen Ziegler. 2018. Impact [20] Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems. of item consumption on assessment of recommendations in user studies. In In Recommender systems handbook. Springer, 257–297. Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 49–53. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [12] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Konstan. 2014. Exploring the filter bubble: the effect of using recommender you need. In Advances in neural information processing systems. 5998–6008. systems on content diversity. In Proceedings of the 23rd international conference [22] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match- on World wide web. ACM, 677–686. ing networks for one shot learning. In Advances in Neural Information Processing [13] Aditya Parameswaran, Petros Venetis, and Hector Garcia-Molina. 2011. Rec- Systems. 3630–3638. ommendation systems with complex constraints: A course recommendation [23] Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. 2012. perspective. ACM Transactions on Information Systems (TOIS) 29, 4 (2011), 20. Auralist: introducing serendipity into music recommendation. In Proceedings [14] Zachary A Pardos, Zihao Fan, and Weijie Jiang. 2019. Connectionist recom- of the fifth ACM international conference on Web search and data mining. ACM, mendation in the wild: on the utility and scrutability of neural networks for 13–22.