=Paper=
{{Paper
|id=Vol-2482/paper52
|storemode=property
|title=Neural Educational Recommendation Engine (NERE)
|pdfUrl=https://ceur-ws.org/Vol-2482/paper52.pdf
|volume=Vol-2482
|authors=Moin Nadeem,Dustin Stansbury,Shane Mooney
|dblpUrl=https://dblp.org/rec/conf/cikm/NadeemSM18
}}
==Neural Educational Recommendation Engine (NERE)==
Neural Educational Recommendation Engine (NERE) Moin Nadeem Dustin Stansbury Shane Mooney Quizlet, Inc Quizlet, Inc Quizlet, Inc 501 2nd St 501 2nd St 501 2nd St San Francisco, CA San Francisco, CA San Francisco, CA moin.nadeem@quizlet.com dustin@quizlet.com shane@quizlet.com ABSTRACT sets, it has become nearly impossible for users to sift through Quizlet is the most popular online learning tool in the United all of the available content. This motivates a need for a sys- States, and is used by over 32 of high school students, and tem that will adapt to a user’s preferences and make rec- 1 ommendations on what they should study next, given their 2 of college students. With more than 95% of Quizlet users prior history. reporting improved grades as a result, the platform has be- This is not only motivated from a product perspective, but come the de-facto tool used in millions of classrooms. also by the rise of personalized learning. As a result of the In this paper, we explore the task of recommending suit- rise of personalization in the e-commerce [10], social media able content for a student to study, given their prior inter- [4], and dating [1], many in education and research have ests, as well as what their peers are studying. We propose grown curious about the implications personalized learning a novel approach, i.e. Neural Educational Recommenda- may have upon students. tion Engine (NERE), to recommend educational content by Personalized learning can be defined as any functionality leveraging student behaviors rather than ratings. We have which enables a system to unique address each individual found that this approach better captures social factors that learner’s needs and characteristics. This includes, but isn’t are more aligned with learning. limited to, prior knowledge, rate of learning, interests, and NERE is based on a recurrent neural network that in- preferences. This provides the ability to ensure that each cludes collaborative and content-based approaches for rec- user’s experience is best optimized for their unique needs ommendation, and takes into account any particular stu- and may save them time that would be otherwise wasted. dent’s speed, mastery, and experience to recommend the ap- For an example that is applicable to Quizlet, one user propriate task. We train NERE by jointly learning the user may prefer to study content suitable to study with Spell embeddings and content embeddings, and attempt to pre- Mode (where students practice spelling by typing the spo- dict the content embedding for the final timestamp. We also ken word). Our algorithm would take that into account develop a confidence estimator for our neural network, which by biasing recommendations that are commonly studied in is a crucial requirement for productionizing this model. Spell Mode. Similarly, we may expect our algorithm to take We apply NERE to Quizlet’s proprietary dataset, and user performance into account, and continue to recommend present our results. We achieved an R2 score of 0.81 in the topics that the user hasn’t quite mastered yet. content embedding space, and a recall score of 54% on our The main contribution of this paper is a deep learning 100 nearest neighbors. This vastly exceeds the recall@100 based system that provides personalized recommendations score of 12% that a standard matrix-factorization approach to Quizlet users, answering the question ”What should I provides. We conclude with a discussion on how NERE will study next?”. be deployed, and position our work as one of the first edu- The rest of this paper is structured as follows: a sum- cational recommender systems for the K-12 space. marization of previous literature for (educational) recom- mender systems is provided in Section 2. Section 3 provides Keywords an overview of our system architecture, model architecture, Recommender Systems, Deep Learning, Education, Quizlet, and dataset construction. We continue with a qualitative Recurrent Neural Networks, Attention and quantitative assessment of our system in Section 4. Fi- nally, we conclude our paper and provide a direction for future work in Section 5. 1. INTRODUCTION Founded in 2005, and used by more than 23 of high school students, Quizlet, Inc. is the largest growing educational 2. BACKGROUND website in the United States [7]. The interactive platform Recommender Systems are a widely studied field, with permits students to learn any given ”set”, or collections of contributions from major players such as Netflix [6], Google terms and definitions, in a variety of ways. However, with [4], and Amazon [10]. The vast majority of these methods over 30 million monthly active users, and 250 million study use matrix factorization techniques to decompose a user’s preferences matrix, and an item ratings matrix into a latent space that represents how a user may rate a new item; this Copyright © CIKM 2018 for the individual papers by the papers' latent space is commonly derived from an Alternating Least authors. Copyright © CIKM 2018 for the volume as a collection Squares (ALS) algorithm. by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). However, we believe that matrix factorization approaches However, one major drawback of their method is the level aren’t well suited for educational applications. To begin, of compute with which Google provides Covington et al. the user-set matrix is extremely sparse. This makes stan- This creates a challenge for us in creating a neural recom- dard matrix factorization based methods infeasible. These mendation system while remaining within realistic compu- methods are also ill suited to material that is sequenced with tational resources. temporal dependencies, as is usually the case for educational material. 3. METHODS Instead, we attempt to make the problem computationally In this section, we provide an overview of how we con- tractable by recurrent neural networks and set vectorization, structed our dataset, what our production system architec- which are able to learn both temporal dependencies and a ture will be, as well as how NERE is architected in detail. dense representation of our data respectively. The rest of this section serves to summarize the current state of deep 3.1 Dataset Construction neural networks with respect to both the current state of recommender systems, as well as Technology Enabled Learn- In order to train NERE, Quizlet, Inc. assembled a pro- ing (TEL). We rely heavily upon previous contributions from prietary dataset. Internally, we use Google BigQuery [14] the intersection of the two fields: Recommender Systems for for all of our data warehousing needs. From BigQuery, we Technology Enabled Learning (RecSysTEL). assembled two datasets from our activity logs: one which detailed our users and their respective metadata, and the second which detailed all sets studied by these users, and 2.1 Literature Review their respective metadata. Most recently, Tang & Pardos [17] are the only other re- The users dataset contained the following fields: searchers in the RecSysTEL field who have explored the use of Recurrent Neural Networks (RNNs) for the purposes of Field Purpose User ID Uniquely mapping a row to a user. personalization in learning. Their work leveraged RNNs to Study Date Bias the model to recommend newer content. model navigational behaviors throughout Massively Open Obfuscated IP Address Geo lookup to derive latitude, and longitude for locality. Online Courses (MOOCs). This research was conducted Preferred Term Lang Most common language to study terms in. Preferred Def Lang Most common language to study definitions in. with the explicit intention of accelerating or decelerating Preferred Platform Most common platform (Web, iOS, etc) to study on. learning as a result of performance in a given subject; the Beginning Timestamp Timestamp for when the study session started. benefit to the user is a reduction in learning time and/or Ending Timestamp Timestamp for when the study session ended. increased performance. Set ID The set they studied during their session. Session Length The number of minutes that their study session lasted. We believe that this work is quite notable due to the level of detail included in the model. Interactions as fine-grained Table 1: Table 1 contains information about all of as video pauses and changing video speed are included in our users and their metadata. the model as a proxy for mastery. However, Tang & Pardos’ algorithm was purely collaborative, and never leveraged the The sets dataset contained the following fields: content of the MOOC(s) studied. We believe that this is an underexplored field in RecSysTEL, and aim for this to be a Field Purpose major contribution of our work. Set ID Uniquely mapping each set to a row. Terms All terms in a set as a space-delimited string. Outside of the field of education, Covington, Adams, and Definitions All definitions in a set as a space-delimited string. Sargin [4] at YouTube have developed the first recommenda- Studier Count Number of unique users that have studied this set. tion system used in an industry setting that leverages deep Broad Subject A high-level subject classification of the set. neural networks. Mean Studier Age The average age of the users who study the set. Term Language The language that terms are in. Covington et al.’s paper is interesting for two reasons. Definition Langage The language that definitions are in. First, it demonstrates a successful use of a neural recom- Total Views The total number of views that this set has received. mendation system at scale, thus mitigating any concerns Has Images Indicating whether this set contains images. about scaling such a system in production. Secondly, videos Has Diagrams Indicating whether this set contains diagrams. Preferred Study Mode The most common study mode used with this set. are quite analogous to Quizlet sets: both videos and sets Preferred Platform The most common platform (Web, iOS, etc.) used. represent ways to learn about topics, and may be episodic Mean Session Length The average session length for this set, in minutes. in nature. To provide an example, if a user watched ”Full House Table 2: Table 2 contains information about all of Episode 1” on YouTube, a good recommendation would be the sets and their metadata. ”Full House Episode 2”. Likewise, a good recommendation for a user who studied ”Hamlet Chapter 1” would be ”Ham- Once the datasets were assembled, we began cleaning the let Chapter 2”. In order to generate recommendations such data. Since user privacy is quite important to Quizlet’s val- as these, Covington et al. added search tokens as a feature ues, we removed all users below the age of thirteen, and ob- to their network. fuscated Internet Protocol (IP) addresses by dropping the In order to deal with the vast swaths of YouTube videos, last octet. We believe that this is an important step to- Covington et al. split their network into two sub-networks. wards preserving anonymity while still preserving quality One network served to filter a large corpus of videos into recommendations. those which the user may be interested in, and the second All categorical variables, such as term language, were mapped network (with access to many more features than the first) to integers. All continuous variables were scaled between served to rank these candidates. Finally, their algorithm was zero and one (with unit variance) to ensure smooth gradi- both content-based and collaborative, demonstrating the vi- ents. We replaced any missing continuous values with the ability of a hybrid approach. mean of the dataset. Lastly, we mapped all IP addresses to their respective latitude and longitude, with the intuition reads these recommendations from Spanner when serving that students in close proximity may be studying similar content. Figure 1 depicts this flow visually. sets. Our web server reads from this cache when serving user Finally, a preliminary test of NERE with this dataset content. Since the model takes 2ms to predict on each user found it difficult to model students who were studying for with a CPU, we have opted to use a CPU-backed instance multiple classes on Quizlet. Intuitively, this makes sense, rather than a GPU-backed instance due to infrastructure as the recurrent neural network is looking for temporal re- cost. lations in places where these relations were murky at best. We solve this by separating sequences by their broad sub- 3.3 Algorithm ject1 column. This was done in practice by concatenating In this subsection, we first introduce a formalization of each User ID with the subject they studied, ensuring each our set-based recommendation task. Then, we describe our row is unique in both user and subject classification. After proposed NERE model architecture in detail. cleaning, we were left with 1,616,004 unique user-subject Session-based recommendation is the task of predicting combinations to be fed into our model. what a user would like to study next when their previous To vectorize our Words and Definitions, we took the space- history and metadata are provided. delimited string and removed stopwords and non-ASCII char- We let X = [s1 , s2 , s3 , ..., sn−1 , sn ] be a study session, acters. Next, we tokenized it and trained 128-dimensional where si ∈ S (1 ≤ i ≤ n), n is the input length, and S GloVe embeddings, which effectively creates an implemen- represents the pool of study sessions. We learn a function tation of Set2Vec[12]. These embeddings were concatenated f Ŵ (·) such that for any given set of n prefixes, we get an along with the preprocessed set metadata to create our set output Y = f Ŵ (X). vectors. Since our recommender will need to predict several states Finally, we transformed our dataset into a timeseries for- [s0n+1 , s1n+1 , ..., sm th n+1 ] for the (n + 1) timestep, where m is mat by concatenating all user study sessions into a single the number of recommendations desired, we must be able axis and sorting by ending timestamp. We chose a session to derive several Quizlet sets from Y . We let Y be a 128- length of 5 timesteps, since 90% of our users have at least dimensional vector that represents the content for a Qui- five sessions. The dimensions of the resultant datasets are zlet set and perform NNDescent [5] for a fast, approximate as follows: m-nearest neighbors search algorithm on Y . We find that this provides an efficient manner to recommend multiple sets • User Metadata: (1616004, 5, 13) while maintaining a dense representation for the model to • Set Metadata: (1616004, 5, 12) learn. • Set Content Vectors: (1616004, 5, 128) 3.4 Model Architecture Our model consists of 56 layers, 22 of which are inputs to 3.2 System Architecture the model. Figure 2 depicts a portion of our model archi- tecture. For deployment purposes, we have the following system In our architecture, we employ quite a few non-standard architecture. layers popular in Natural Language Processing. The remain- der of this subsection will be explaining these layers. 3.4.1 Embedding Layer In order to provide a dense representation for our categor- ical variables, we trained a embedding matrix [11]. Each categorical variable Ci ∈ C, where C is the set of categorical variables, was mapped to a 32-dimensional rep- resentation. This was done with the explicit intention that Figure 1: This figure depicts how our model is used the model may learn a spatial relation for some of these to serve recommendations in production. variables. Each category cj ∈ Ci (1 ≤ j ≤ |Ci |) is learned using the Quizlet uses Apache Airflow [16], the industry standard following table: for Extract-Transform-Load (ETL) pipelines, to schedule LTW i (j) = Wji (1) jobs. Every week, Apache Airflow reads datasets from Big- i 32×|Ci | Query. Within Airflow, this dataset is preprocessed, and Where W ∈ R , |Ci | represents the number of cat- sent to TensorFlow. TensorFlow predicts which sets the user egories in Ci , and Wji is the j th column of matrix W i that should study next, and sends the embedding back to Airflow. represents the 32-dimensional vector corresponding to cat- Airflow maps the vectors to sets by determining the N near- egory cj . It is important to note that the entirety of this est neighbors of this embedding, and subsequently caches matrix is randomly initialized, and the vectors are learned these recommendations to spanner. Finally, our web server jointly through backpropagation. 1 The broad subject field was of the following enumerated 3.4.2 Bidirectional Layers type: Theology, History, Uncommon Languages, Commu- Bidirectional Layers [15] are commonly utilized to help nications, Formal sciences, Visual Arts, Social Sciences, Applied Sciences, Vocabulary, German, Performing Arts, models learn sequences. Sports, French, Reading Vocabulary, Spanish, Natural Sci- The intuition behind bidirectional layers is that it helps ences, and Geography. recurrent layers learn sequences by making the context more exp(ui uw ) αi t = P (3) t exp(ui tuw ) X si = αit hit (4) i Where uw is a learned feature-level attention vector, Ww are the weights of the attention layer, and αit is a weighted tth element of the ith vector. Intuitively, this implemen- tation makes a lot of sense: the model is computing how important each feature in each timestep is against all other features in the same timestep, and re-weighing the input ac- cordingly. All weights in this layer are randomly initialized and jointly learned throughout the training process. 3.4.4 Miscellaneous Features While most other works have used Long-Short Term Mem- ory (LSTM) [8] cells for their recurrent unit, we chose to use Gated Recurrent Unit (GRU) [2] cells. As Chung, et al. show in [3], for short sequences, GRU cells commonly are more practical due to not having an internal memory. We saw a noticeable speed up of more than 20% when using a GRU cell over an LSTM. In order for these models (over 5,994,444 learnable pa- rameters) to generalize, we had to apply some strict regu- larization. We applied 50% dropout on layers following a recurrent cell, and applied 0.001 L2 regularization on the recurrent kernel itself. Furthermore, we used batch nor- malization to ensure that our inputs are zero-centered with normalized variance. Following the results of Santurkar et al. [13], we also noticed faster training times as a result of these smoother gradients. Figure 2: This figure provides a slice of our model 4. RESULTS architecture; some inputs have been excluded for In this section, we evaluate NERE from a qualitative and brevity. quantitative perspective. We compare our model against a baseline matrix factorization approach, and analyze several variations of the model for the purposes of introspection. Table 3 shows the qualitative results of our recommen- explicit. It splits a recurrent layer into a part that is respon- dation system. The studied column shows the set that the sible for learning the input normally, and another part that user studied, while the recommendation column shows the is responsible for learning the input backwards; this helps set that was recommended for the user to study. For this the model understand what may happen in the future. particular recommendation, our system understands that a Formally, given some study sequence x1 , x2 , x3 , ..., xn−1 , xn , student had been learning about discussing time (in terms it would feed [(x1 , xn ), (x2 , xn−1 ), ..., (xn , x1 )] as the input. of days of the week) in French, and recommended a corre- At first sight, one would believe that this leaks information; sponding set about months of the year. This shows that the however, humans do precisely the same by inferring future model understands that the user is learning about temporal states from previous experience. relations. On a higher level, this demonstrates a level of un- derstanding of both the content that a user desires to learn 3.4.3 Attention With Context and the difficulty at which he desires to learn it. Based off of the work of Yang et al., Attention With Con- We use two proxies to assess model accuracy: recall@100 text is a mechanism that helps the model learn which fea- and R2 . In order to compute recall@100, we take the 100 tures are important, and which ones may be discarded. As nearest neighbors of our output embedding, and check if the the name may imply, it helps the model pay attention. set that the learner studied at timestep Tn+1 is in the set Formally, we add a new layer that performs the following of nearest 100 neighbors. If it is, we mark that recommen- operation. We assume that i is the ith timestamp in our dation as correct; otherwise, it is incorrect. We use the 100 input, and t is the tth element in the vector i. Lastly, hit nearest neighbors due to the density of our embedding space, is the output of the ith element of the tth timestamp in as well as the fact that many of the sets in our embedding the layer that precedes our attention layer. The following space are near-duplicates due to a lack of canonicalization. equations describe the operations of the Attention layer: We use R2 to assess whether the predictions in the em- bedding space match the actual distribution; this serves as a sanity check to ensure that our model’s output distribution uit = tanh(Ww hi t + bw ) (2) is correlated to the expected distribution. Recommendation Results Studied Recommendation Term Definition Term Definition lundi Monday au printemps spring mardi Tuesday en été summer mercredi Wednesday Les mois the months jeudi Thursday Janvier January vendredi Friday Février Febuary samedi Saturday Mars March dimanche Sunday Avril April un an a year Mai May une année a year Juin June aprés after Juillet July avant before Aoút August aprés-demain the day after tomorrow Septembre September un aprés-midi an afternoon Octobre October aujourd’hui today Novembre November demain tomorrow Décembre December demain matin tomorrow morning Quand When demain aprés-midi tomorrow afternoon Oú Where Figure 3: This figure visualizes how the length of demain soir tomorrow night Comment How hier yesterday Avec qui With whom the input may affect model performance. Table 3: Table 3 shows the results of our recommen- dation system. 4.0.1 Comparison Against Matrix Factorization We compare the performance of NERE against that of TensorRec [9], a library written by James Kirk that uses the Tensorflow API. TensorRec accepts a user matrix, item matrix, and interactions matrix as inputs, and formulates Figure 4: This figure visualizes the model’s internal a predictions matrix as an output. For the user matrix, attention vector. we provide the user metadata matrix that NERE is pro- vided. We concatenate the set vectors and set metadata, 3 visualizes how the model pays attention to the input, as and this represents the item matrix. Lastly, we create an in- well as how it learns the attention vector over time. Brighter teractions matrix of dimensions (|U SERS|, |SET S|), where rectangles indicate that more attention is being placed on some (i, j) = 1 if user i studied set j. those blocks. We trained TensorRec on this dataset, and it obtained These results show incredible insight into the decision pro- a Recall@100 of 0.12 after convergence. We believe this cess of the model. We can see that at the beginning of the validates our belief in a core difference between a matrix input, the model focuses on the metadata; aspects such as factorization approach and our approach: even after exten- term and definition language are deemed incredibly impor- sive customization, an approach based off of temporal data tant. However, as time goes on, the attention shifts from set is much more likely to provide quality recommendations for and user metadata towards content-based features. We see educational content. that the attention in the very last timestep shifts towards 4.0.2 Input Sequence Length the content, which aligns with our expectations. Our NERE model is based off of the assumption that a 4.0.4 A Purely Content/Collaborative Approach user is purposefully selecting sets to study, and topically Next, we try and understand how important our features related to a greater theme. This permits us to also believe are to the model. that the sets are temporally related, and therefore, enables We train and test two variations, with and without the us to use a recurrent neural network. 128-dimensional content vectors, to see how important a Figure 3 validates this assumption by comparing model content-based approach is for NERE. The impacts of these performance against the input sequence length. We see that variations are demonstrated in Table 4. the R2 score slowly converges, but that the recall@100 met- ric steadily increases until our fourth input sequence. This Both Content Metadata implies that there may be performance advantages to be ob- R2 0.81 0.78 0.55 tained by increasing the length of the input sequence past Recall@100 0.54 0.38 0.001 four. However, since we begin to lose a significant number of users in our dataset if we extend beyond five timesteps, we risk creating a model that will not generalize to our en- Table 4: Table 4 demonstrates the importance of tire userbase. As a result, we believe that five timesteps is a our content vectors. good balance between desired accuracy and generalizability. This shows that a hybrid (both collaborative and content- 4.0.3 Where’s the Attention based) is clearly superior over either one independently. It One popular use of attention in deep neural networks is is important to notice that a content-based approach will to visualize the model’s understanding of the input. Figure obtain a high R2 score, since it is easy for the model to learn the underlying distribution, but will not recommend 7. REFERENCES the appropriate set. This demonstrates the importance of [1] L. Brozovsky and V. Petricek. Recommender System various collaborative features that we explicitly include. for Online Dating Service. 2007. For example, the nearest neighbor for a set whose term [2] K. Cho, B. van Merrienboer, D. Bahdanau, and and definition languages are in Spanish, is actually a set Y. Bengio. On the Properties of Neural Machine whose term and definition languages are in German. How- Translation: Encoder-Decoder Approaches. 2014. ever, the model will continue to recommend sets with term [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. and definition languages in German, since it has learned this Empirical Evaluation of Gated Recurrent Neural from a user’s prior history. This speaks to the importance Networks on Sequence Modeling. pages 1–9, 2014. of collaborative features in NERE. [4] P. Covington, J. Adams, and E. Sargin. Deep Neural On the whole, we have shown that NERE provides qual- Networks for YouTube Recommendations. Proceedings ity recommendations with which we can provide a deeply of the 10th ACM Conference on Recommender personalized experience for learning, and believe this results Systems - RecSys ’16, pages 191–198, 2016. exceed expectations for our application. [5] W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity 5. CONCLUSION & FUTURE WORK measures. Proceedings of the 20th international conference on World wide web - WWW ’11, page 577, In this work, we have proposed Neural Educational Rec- 2011. ommendation Engine (NERE) to address the problem of [6] C. A. Gomez-Uribe and N. Hunt. The Netflix personalized sequential recommendation in the Technology Recommender System. ACM Transactions on Enabled Learning (TEL) domain. By leveraging both content- Management Information Systems, 6(4):1–19, 2015. based and collaborative features, our model can capture temporal trends in a user’s history, and provide recommen- [7] Hillá Meller. SimilarWeb Digital Visionary Awards: dations as to what they should learn next. By incorporat- 2015, 2015. ing features such as attention and bidirectionality into our [8] S. Hochreiter and J. Urgen Schmidhuber. Long model, we were able to achieve a state of the art recall@100 Short-Term Memory. Neural Computation, score of 0.54. Moreover, we have performed an analysis of 9(8):1735–1780, 1997. our model and have shown that it outperforms both a stan- [9] J. Kirk. TensorRec: A Recommendation Engine dalone content-based and collaborative approach. Lastly, we Framework in TensorFlow, 2017. have shown that our model is learning from both the user [10] G. Linden, B. Smith, and J. York. Amazon.com and set metadata, in addition to content, by visualizing the recommendations: Item-to-item collaborative filtering. attention mechanism. IEEE Internet Computing, 7(1):76–80, 2003. As to future work, we believe there is significant work left [11] D. López-Sánchez, J. R. Herrero, A. G. Arrieta, and to be done in ranking the suggestions; there are significantly J. M. Corchado. Hybridizing metric learning and better ways to choose sets from a candidate pool than to rec- case-based reasoning for adaptable clickbait detection, ommend the N closest neighbors. Furthermore, we believe 2017. that an attempt at canonicalizing similar sets would increase [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean. the Recall@100 metric, and should be explored. Distributed representations of words and hrases and their compositionality. In NIPS, 2013. [13] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How 6. ACKNOWLEDGEMENTS Does Batch Normalization Help Optimization? (No, It First and foremost, I would like to thank my mentors Is Not About Internal Covariate Shift). 2018. Dustin Stansbury and Shane Mooney for the exceptional [14] K. Sato. An Inside Look at Google BigQuery. White support and mentorship throughout this project. Both of Paper, Google Inc, 2012. them were supportive, answered my many questions, and [15] M. Schuster and K. K. Paliwal. Bidirectional recurrent were quite open to letting me explore. Shane, thank you neural networks. IEEE Transactions on Signal for providing much needed practical wisdom, for reviewing Processing, 1997. countless pull requests, and for providing much needed com- [16] D. P. Takamori. Apache Airflow, 2016. mentary on this paper. Dustin, thank you for the incredible [17] S. Tang and Z. A. Pardos. Personalized Behavior knowledge about all things machine learning. This project Recommendation. Adjunct Publication of the 25th wouldn’t have been possible without you two. Conference on User Modeling, Adaptation and I would also like to acknowledge Alex Pinchuk and Shaun Personalization - UMAP ’17, (July):165–170, 2017. Mitschrich for providing endless platform support through- out this project, including honoring my numerous requests for more compute. Lastly, I would like to acknowledge the fabulous Qui- zlet team who provided incredible companionship through- out this summer, as well as my parents for supporting me throughout this process. Keep on learning!