=Paper= {{Paper |id=Vol-2482/paper52 |storemode=property |title=Neural Educational Recommendation Engine (NERE) |pdfUrl=https://ceur-ws.org/Vol-2482/paper52.pdf |volume=Vol-2482 |authors=Moin Nadeem,Dustin Stansbury,Shane Mooney |dblpUrl=https://dblp.org/rec/conf/cikm/NadeemSM18 }} ==Neural Educational Recommendation Engine (NERE)== https://ceur-ws.org/Vol-2482/paper52.pdf
       Neural Educational Recommendation Engine (NERE)

                   Moin Nadeem                            Dustin Stansbury                   Shane Mooney
                     Quizlet, Inc                             Quizlet, Inc                      Quizlet, Inc
                     501 2nd St                               501 2nd St                        501 2nd St
                  San Francisco, CA                        San Francisco, CA                 San Francisco, CA
           moin.nadeem@quizlet.com                      dustin@quizlet.com                shane@quizlet.com


ABSTRACT                                                             sets, it has become nearly impossible for users to sift through
Quizlet is the most popular online learning tool in the United       all of the available content. This motivates a need for a sys-
States, and is used by over 32 of high school students, and          tem that will adapt to a user’s preferences and make rec-
1                                                                    ommendations on what they should study next, given their
2
   of college students. With more than 95% of Quizlet users
                                                                     prior history.
reporting improved grades as a result, the platform has be-
                                                                        This is not only motivated from a product perspective, but
come the de-facto tool used in millions of classrooms.
                                                                     also by the rise of personalized learning. As a result of the
   In this paper, we explore the task of recommending suit-
                                                                     rise of personalization in the e-commerce [10], social media
able content for a student to study, given their prior inter-
                                                                     [4], and dating [1], many in education and research have
ests, as well as what their peers are studying. We propose
                                                                     grown curious about the implications personalized learning
a novel approach, i.e. Neural Educational Recommenda-
                                                                     may have upon students.
tion Engine (NERE), to recommend educational content by
                                                                        Personalized learning can be defined as any functionality
leveraging student behaviors rather than ratings. We have
                                                                     which enables a system to unique address each individual
found that this approach better captures social factors that
                                                                     learner’s needs and characteristics. This includes, but isn’t
are more aligned with learning.
                                                                     limited to, prior knowledge, rate of learning, interests, and
   NERE is based on a recurrent neural network that in-
                                                                     preferences. This provides the ability to ensure that each
cludes collaborative and content-based approaches for rec-
                                                                     user’s experience is best optimized for their unique needs
ommendation, and takes into account any particular stu-
                                                                     and may save them time that would be otherwise wasted.
dent’s speed, mastery, and experience to recommend the ap-
                                                                        For an example that is applicable to Quizlet, one user
propriate task. We train NERE by jointly learning the user
                                                                     may prefer to study content suitable to study with Spell
embeddings and content embeddings, and attempt to pre-
                                                                     Mode (where students practice spelling by typing the spo-
dict the content embedding for the final timestamp. We also
                                                                     ken word). Our algorithm would take that into account
develop a confidence estimator for our neural network, which
                                                                     by biasing recommendations that are commonly studied in
is a crucial requirement for productionizing this model.
                                                                     Spell Mode. Similarly, we may expect our algorithm to take
   We apply NERE to Quizlet’s proprietary dataset, and
                                                                     user performance into account, and continue to recommend
present our results. We achieved an R2 score of 0.81 in the
                                                                     topics that the user hasn’t quite mastered yet.
content embedding space, and a recall score of 54% on our
                                                                        The main contribution of this paper is a deep learning
100 nearest neighbors. This vastly exceeds the recall@100
                                                                     based system that provides personalized recommendations
score of 12% that a standard matrix-factorization approach
                                                                     to Quizlet users, answering the question ”What should I
provides. We conclude with a discussion on how NERE will
                                                                     study next?”.
be deployed, and position our work as one of the first edu-
                                                                        The rest of this paper is structured as follows: a sum-
cational recommender systems for the K-12 space.
                                                                     marization of previous literature for (educational) recom-
                                                                     mender systems is provided in Section 2. Section 3 provides
Keywords                                                             an overview of our system architecture, model architecture,
Recommender Systems, Deep Learning, Education, Quizlet,              and dataset construction. We continue with a qualitative
Recurrent Neural Networks, Attention                                 and quantitative assessment of our system in Section 4. Fi-
                                                                     nally, we conclude our paper and provide a direction for
                                                                     future work in Section 5.
1.   INTRODUCTION
  Founded in 2005, and used by more than 23 of high school
students, Quizlet, Inc. is the largest growing educational           2.   BACKGROUND
website in the United States [7]. The interactive platform              Recommender Systems are a widely studied field, with
permits students to learn any given ”set”, or collections of         contributions from major players such as Netflix [6], Google
terms and definitions, in a variety of ways. However, with           [4], and Amazon [10]. The vast majority of these methods
over 30 million monthly active users, and 250 million study          use matrix factorization techniques to decompose a user’s
                                                                     preferences matrix, and an item ratings matrix into a latent
                                                                     space that represents how a user may rate a new item; this
Copyright © CIKM 2018 for the individual papers by the papers'
                                                                     latent space is commonly derived from an Alternating Least
authors. Copyright © CIKM 2018 for the volume as a collection        Squares (ALS) algorithm.
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
  However, we believe that matrix factorization approaches         However, one major drawback of their method is the level
aren’t well suited for educational applications. To begin,       of compute with which Google provides Covington et al.
the user-set matrix is extremely sparse. This makes stan-        This creates a challenge for us in creating a neural recom-
dard matrix factorization based methods infeasible. These        mendation system while remaining within realistic compu-
methods are also ill suited to material that is sequenced with   tational resources.
temporal dependencies, as is usually the case for educational
material.                                                        3.    METHODS
  Instead, we attempt to make the problem computationally
                                                                   In this section, we provide an overview of how we con-
tractable by recurrent neural networks and set vectorization,
                                                                 structed our dataset, what our production system architec-
which are able to learn both temporal dependencies and a
                                                                 ture will be, as well as how NERE is architected in detail.
dense representation of our data respectively. The rest of
this section serves to summarize the current state of deep       3.1     Dataset Construction
neural networks with respect to both the current state of
recommender systems, as well as Technology Enabled Learn-          In order to train NERE, Quizlet, Inc. assembled a pro-
ing (TEL). We rely heavily upon previous contributions from      prietary dataset. Internally, we use Google BigQuery [14]
the intersection of the two fields: Recommender Systems for      for all of our data warehousing needs. From BigQuery, we
Technology Enabled Learning (RecSysTEL).                         assembled two datasets from our activity logs: one which
                                                                 detailed our users and their respective metadata, and the
                                                                 second which detailed all sets studied by these users, and
2.1   Literature Review                                          their respective metadata.
   Most recently, Tang & Pardos [17] are the only other re-        The users dataset contained the following fields:
searchers in the RecSysTEL field who have explored the use
of Recurrent Neural Networks (RNNs) for the purposes of          Field                   Purpose
                                                                 User ID                 Uniquely mapping a row to a user.
personalization in learning. Their work leveraged RNNs to        Study Date              Bias the model to recommend newer content.
model navigational behaviors throughout Massively Open           Obfuscated IP Address   Geo lookup to derive latitude, and longitude for locality.
Online Courses (MOOCs). This research was conducted              Preferred Term Lang     Most common language to study terms in.
                                                                 Preferred Def Lang      Most common language to study definitions in.
with the explicit intention of accelerating or decelerating      Preferred Platform      Most common platform (Web, iOS, etc) to study on.
learning as a result of performance in a given subject; the      Beginning Timestamp     Timestamp for when the study session started.
benefit to the user is a reduction in learning time and/or       Ending Timestamp        Timestamp for when the study session ended.
increased performance.                                           Set ID                  The set they studied during their session.
                                                                 Session Length          The number of minutes that their study session lasted.
   We believe that this work is quite notable due to the level
of detail included in the model. Interactions as fine-grained    Table 1: Table 1 contains information about all of
as video pauses and changing video speed are included in         our users and their metadata.
the model as a proxy for mastery. However, Tang & Pardos’
algorithm was purely collaborative, and never leveraged the        The sets dataset contained the following fields:
content of the MOOC(s) studied. We believe that this is an
underexplored field in RecSysTEL, and aim for this to be a       Field                   Purpose
major contribution of our work.                                  Set ID                  Uniquely mapping each set to a row.
                                                                 Terms                   All terms in a set as a space-delimited string.
   Outside of the field of education, Covington, Adams, and      Definitions             All definitions in a set as a space-delimited string.
Sargin [4] at YouTube have developed the first recommenda-       Studier Count           Number of unique users that have studied this set.
tion system used in an industry setting that leverages deep      Broad Subject           A high-level subject classification of the set.
neural networks.                                                 Mean Studier Age        The average age of the users who study the set.
                                                                 Term Language           The language that terms are in.
   Covington et al.’s paper is interesting for two reasons.      Definition Langage      The language that definitions are in.
First, it demonstrates a successful use of a neural recom-       Total Views             The total number of views that this set has received.
mendation system at scale, thus mitigating any concerns          Has Images              Indicating whether this set contains images.
about scaling such a system in production. Secondly, videos      Has Diagrams            Indicating whether this set contains diagrams.
                                                                 Preferred Study Mode    The most common study mode used with this set.
are quite analogous to Quizlet sets: both videos and sets        Preferred Platform      The most common platform (Web, iOS, etc.) used.
represent ways to learn about topics, and may be episodic        Mean Session Length     The average session length for this set, in minutes.
in nature.
   To provide an example, if a user watched ”Full House          Table 2: Table 2 contains information about all of
Episode 1” on YouTube, a good recommendation would be            the sets and their metadata.
”Full House Episode 2”. Likewise, a good recommendation
for a user who studied ”Hamlet Chapter 1” would be ”Ham-           Once the datasets were assembled, we began cleaning the
let Chapter 2”. In order to generate recommendations such        data. Since user privacy is quite important to Quizlet’s val-
as these, Covington et al. added search tokens as a feature      ues, we removed all users below the age of thirteen, and ob-
to their network.                                                fuscated Internet Protocol (IP) addresses by dropping the
   In order to deal with the vast swaths of YouTube videos,      last octet. We believe that this is an important step to-
Covington et al. split their network into two sub-networks.      wards preserving anonymity while still preserving quality
One network served to filter a large corpus of videos into       recommendations.
those which the user may be interested in, and the second          All categorical variables, such as term language, were mapped
network (with access to many more features than the first)       to integers. All continuous variables were scaled between
served to rank these candidates. Finally, their algorithm was    zero and one (with unit variance) to ensure smooth gradi-
both content-based and collaborative, demonstrating the vi-      ents. We replaced any missing continuous values with the
ability of a hybrid approach.                                    mean of the dataset. Lastly, we mapped all IP addresses
to their respective latitude and longitude, with the intuition   reads these recommendations from Spanner when serving
that students in close proximity may be studying similar         content. Figure 1 depicts this flow visually.
sets.                                                              Our web server reads from this cache when serving user
   Finally, a preliminary test of NERE with this dataset         content. Since the model takes 2ms to predict on each user
found it difficult to model students who were studying for       with a CPU, we have opted to use a CPU-backed instance
multiple classes on Quizlet. Intuitively, this makes sense,      rather than a GPU-backed instance due to infrastructure
as the recurrent neural network is looking for temporal re-      cost.
lations in places where these relations were murky at best.
We solve this by separating sequences by their broad sub-        3.3     Algorithm
ject1 column. This was done in practice by concatenating            In this subsection, we first introduce a formalization of
each User ID with the subject they studied, ensuring each        our set-based recommendation task. Then, we describe our
row is unique in both user and subject classification. After     proposed NERE model architecture in detail.
cleaning, we were left with 1,616,004 unique user-subject           Session-based recommendation is the task of predicting
combinations to be fed into our model.                           what a user would like to study next when their previous
   To vectorize our Words and Definitions, we took the space-    history and metadata are provided.
delimited string and removed stopwords and non-ASCII char-          We let X = [s1 , s2 , s3 , ..., sn−1 , sn ] be a study session,
acters. Next, we tokenized it and trained 128-dimensional        where si ∈ S (1 ≤ i ≤ n), n is the input length, and S
GloVe embeddings, which effectively creates an implemen-         represents the pool of study sessions. We learn a function
tation of Set2Vec[12]. These embeddings were concatenated        f Ŵ (·) such that for any given set of n prefixes, we get an
along with the preprocessed set metadata to create our set       output Y = f Ŵ (X).
vectors.                                                            Since our recommender will need to predict several states
   Finally, we transformed our dataset into a timeseries for-    [s0n+1 , s1n+1 , ..., sm                     th
                                                                                        n+1 ] for the (n + 1)    timestep, where m is
mat by concatenating all user study sessions into a single       the number of recommendations desired, we must be able
axis and sorting by ending timestamp. We chose a session         to derive several Quizlet sets from Y . We let Y be a 128-
length of 5 timesteps, since 90% of our users have at least      dimensional vector that represents the content for a Qui-
five sessions. The dimensions of the resultant datasets are      zlet set and perform NNDescent [5] for a fast, approximate
as follows:                                                      m-nearest neighbors search algorithm on Y . We find that
                                                                 this provides an efficient manner to recommend multiple sets
   • User Metadata: (1616004, 5, 13)
                                                                 while maintaining a dense representation for the model to
   • Set Metadata: (1616004, 5, 12)                              learn.

   • Set Content Vectors: (1616004, 5, 128)
                                                                 3.4     Model Architecture
                                                                   Our model consists of 56 layers, 22 of which are inputs to
3.2   System Architecture                                        the model. Figure 2 depicts a portion of our model archi-
                                                                 tecture.
  For deployment purposes, we have the following system
                                                                   In our architecture, we employ quite a few non-standard
architecture.
                                                                 layers popular in Natural Language Processing. The remain-
                                                                 der of this subsection will be explaining these layers.

                                                                 3.4.1     Embedding Layer
                                                                    In order to provide a dense representation for our categor-
                                                                 ical variables, we trained a embedding matrix [11].
                                                                    Each categorical variable Ci ∈ C, where C is the set of
                                                                 categorical variables, was mapped to a 32-dimensional rep-
                                                                 resentation. This was done with the explicit intention that
Figure 1: This figure depicts how our model is used              the model may learn a spatial relation for some of these
to serve recommendations in production.                          variables.
                                                                    Each category cj ∈ Ci (1 ≤ j ≤ |Ci |) is learned using the
  Quizlet uses Apache Airflow [16], the industry standard        following table:
for Extract-Transform-Load (ETL) pipelines, to schedule                                   LTW i (j) = Wji                        (1)
jobs. Every week, Apache Airflow reads datasets from Big-
                                                                              i     32×|Ci |
Query. Within Airflow, this dataset is preprocessed, and            Where W ∈ R            , |Ci | represents the number of cat-
sent to TensorFlow. TensorFlow predicts which sets the user      egories in Ci , and Wji is the j th column of matrix W i that
should study next, and sends the embedding back to Airflow.      represents the 32-dimensional vector corresponding to cat-
Airflow maps the vectors to sets by determining the N near-      egory cj . It is important to note that the entirety of this
est neighbors of this embedding, and subsequently caches         matrix is randomly initialized, and the vectors are learned
these recommendations to spanner. Finally, our web server        jointly through backpropagation.
1
  The broad subject field was of the following enumerated        3.4.2     Bidirectional Layers
type: Theology, History, Uncommon Languages, Commu-                Bidirectional Layers [15] are commonly utilized to help
nications, Formal sciences, Visual Arts, Social Sciences,
Applied Sciences, Vocabulary, German, Performing Arts,           models learn sequences.
Sports, French, Reading Vocabulary, Spanish, Natural Sci-          The intuition behind bidirectional layers is that it helps
ences, and Geography.                                            recurrent layers learn sequences by making the context more
                                                                                                     exp(ui uw )
                                                                                             αi t = P                               (3)
                                                                                                     t exp(ui tuw )

                                                                                                        X
                                                                                                 si =       αit hit                 (4)
                                                                                                        i

                                                                             Where uw is a learned feature-level attention vector, Ww
                                                                          are the weights of the attention layer, and αit is a weighted
                                                                          tth element of the ith vector. Intuitively, this implemen-
                                                                          tation makes a lot of sense: the model is computing how
                                                                          important each feature in each timestep is against all other
                                                                          features in the same timestep, and re-weighing the input ac-
                                                                          cordingly. All weights in this layer are randomly initialized
                                                                          and jointly learned throughout the training process.

                                                                          3.4.4    Miscellaneous Features
                                                                             While most other works have used Long-Short Term Mem-
                                                                          ory (LSTM) [8] cells for their recurrent unit, we chose to
                                                                          use Gated Recurrent Unit (GRU) [2] cells. As Chung, et al.
                                                                          show in [3], for short sequences, GRU cells commonly are
                                                                          more practical due to not having an internal memory. We
                                                                          saw a noticeable speed up of more than 20% when using a
                                                                          GRU cell over an LSTM.
                                                                             In order for these models (over 5,994,444 learnable pa-
                                                                          rameters) to generalize, we had to apply some strict regu-
                                                                          larization. We applied 50% dropout on layers following a
                                                                          recurrent cell, and applied 0.001 L2 regularization on the
                                                                          recurrent kernel itself. Furthermore, we used batch nor-
                                                                          malization to ensure that our inputs are zero-centered with
                                                                          normalized variance. Following the results of Santurkar et
                                                                          al. [13], we also noticed faster training times as a result of
                                                                          these smoother gradients.


Figure 2: This figure provides a slice of our model                       4.   RESULTS
architecture; some inputs have been excluded for                             In this section, we evaluate NERE from a qualitative and
brevity.                                                                  quantitative perspective. We compare our model against a
                                                                          baseline matrix factorization approach, and analyze several
                                                                          variations of the model for the purposes of introspection.
                                                                             Table 3 shows the qualitative results of our recommen-
explicit. It splits a recurrent layer into a part that is respon-         dation system. The studied column shows the set that the
sible for learning the input normally, and another part that              user studied, while the recommendation column shows the
is responsible for learning the input backwards; this helps               set that was recommended for the user to study. For this
the model understand what may happen in the future.                       particular recommendation, our system understands that a
   Formally, given some study sequence x1 , x2 , x3 , ..., xn−1 , xn ,    student had been learning about discussing time (in terms
it would feed [(x1 , xn ), (x2 , xn−1 ), ..., (xn , x1 )] as the input.   of days of the week) in French, and recommended a corre-
At first sight, one would believe that this leaks information;            sponding set about months of the year. This shows that the
however, humans do precisely the same by inferring future                 model understands that the user is learning about temporal
states from previous experience.                                          relations. On a higher level, this demonstrates a level of un-
                                                                          derstanding of both the content that a user desires to learn
3.4.3     Attention With Context                                          and the difficulty at which he desires to learn it.
   Based off of the work of Yang et al., Attention With Con-                 We use two proxies to assess model accuracy: recall@100
text is a mechanism that helps the model learn which fea-                 and R2 . In order to compute recall@100, we take the 100
tures are important, and which ones may be discarded. As                  nearest neighbors of our output embedding, and check if the
the name may imply, it helps the model pay attention.                     set that the learner studied at timestep Tn+1 is in the set
   Formally, we add a new layer that performs the following               of nearest 100 neighbors. If it is, we mark that recommen-
operation. We assume that i is the ith timestamp in our                   dation as correct; otherwise, it is incorrect. We use the 100
input, and t is the tth element in the vector i. Lastly, hit              nearest neighbors due to the density of our embedding space,
is the output of the ith element of the tth timestamp in                  as well as the fact that many of the sets in our embedding
the layer that precedes our attention layer. The following                space are near-duplicates due to a lack of canonicalization.
equations describe the operations of the Attention layer:                    We use R2 to assess whether the predictions in the em-
                                                                          bedding space match the actual distribution; this serves as a
                                                                          sanity check to ensure that our model’s output distribution
                    uit = tanh(Ww hi t + bw )                     (2)     is correlated to the expected distribution.
                      Recommendation Results
                  Studied                      Recommendation
 Term               Definition             Term         Definition
       lundi        Monday                 au printemps spring
       mardi        Tuesday                en été     summer
     mercredi       Wednesday              Les mois     the months
       jeudi        Thursday               Janvier      January
     vendredi       Friday                 Février     Febuary
      samedi        Saturday               Mars         March
    dimanche        Sunday                 Avril        April
       un an        a year                 Mai          May
    une année      a year                 Juin         June
       aprés       after                  Juillet      July
       avant        before                 Aoút        August
  aprés-demain     the day after tomorrow Septembre    September
  un aprés-midi    an afternoon           Octobre      October
   aujourd’hui      today                  Novembre     November
      demain        tomorrow               Décembre    December
  demain matin      tomorrow morning       Quand        When
demain aprés-midi tomorrow afternoon      Oú          Where
                                                                     Figure 3: This figure visualizes how the length of
   demain soir      tomorrow night         Comment      How
        hier        yesterday              Avec qui     With whom    the input may affect model performance.


Table 3: Table 3 shows the results of our recommen-
dation system.


4.0.1    Comparison Against Matrix Factorization
   We compare the performance of NERE against that of
TensorRec [9], a library written by James Kirk that uses
the Tensorflow API. TensorRec accepts a user matrix, item
matrix, and interactions matrix as inputs, and formulates            Figure 4: This figure visualizes the model’s internal
a predictions matrix as an output. For the user matrix,              attention vector.
we provide the user metadata matrix that NERE is pro-
vided. We concatenate the set vectors and set metadata,
                                                                     3 visualizes how the model pays attention to the input, as
and this represents the item matrix. Lastly, we create an in-
                                                                     well as how it learns the attention vector over time. Brighter
teractions matrix of dimensions (|U SERS|, |SET S|), where
                                                                     rectangles indicate that more attention is being placed on
some (i, j) = 1 if user i studied set j.
                                                                     those blocks.
   We trained TensorRec on this dataset, and it obtained
                                                                       These results show incredible insight into the decision pro-
a Recall@100 of 0.12 after convergence. We believe this
                                                                     cess of the model. We can see that at the beginning of the
validates our belief in a core difference between a matrix
                                                                     input, the model focuses on the metadata; aspects such as
factorization approach and our approach: even after exten-
                                                                     term and definition language are deemed incredibly impor-
sive customization, an approach based off of temporal data
                                                                     tant. However, as time goes on, the attention shifts from set
is much more likely to provide quality recommendations for
                                                                     and user metadata towards content-based features. We see
educational content.
                                                                     that the attention in the very last timestep shifts towards
4.0.2    Input Sequence Length                                       the content, which aligns with our expectations.
   Our NERE model is based off of the assumption that a              4.0.4    A Purely Content/Collaborative Approach
user is purposefully selecting sets to study, and topically
                                                                       Next, we try and understand how important our features
related to a greater theme. This permits us to also believe
                                                                     are to the model.
that the sets are temporally related, and therefore, enables
                                                                       We train and test two variations, with and without the
us to use a recurrent neural network.
                                                                     128-dimensional content vectors, to see how important a
   Figure 3 validates this assumption by comparing model
                                                                     content-based approach is for NERE. The impacts of these
performance against the input sequence length. We see that
                                                                     variations are demonstrated in Table 4.
the R2 score slowly converges, but that the recall@100 met-
ric steadily increases until our fourth input sequence. This                              Both    Content      Metadata
implies that there may be performance advantages to be ob-
                                                                         R2               0.81    0.78         0.55
tained by increasing the length of the input sequence past
                                                                         Recall@100       0.54    0.38         0.001
four. However, since we begin to lose a significant number
of users in our dataset if we extend beyond five timesteps,
we risk creating a model that will not generalize to our en-         Table 4: Table 4 demonstrates the importance of
tire userbase. As a result, we believe that five timesteps is a      our content vectors.
good balance between desired accuracy and generalizability.
                                                                        This shows that a hybrid (both collaborative and content-
4.0.3    Where’s the Attention                                       based) is clearly superior over either one independently. It
  One popular use of attention in deep neural networks is            is important to notice that a content-based approach will
to visualize the model’s understanding of the input. Figure          obtain a high R2 score, since it is easy for the model to
learn the underlying distribution, but will not recommend        7.   REFERENCES
the appropriate set. This demonstrates the importance of          [1] L. Brozovsky and V. Petricek. Recommender System
various collaborative features that we explicitly include.            for Online Dating Service. 2007.
   For example, the nearest neighbor for a set whose term         [2] K. Cho, B. van Merrienboer, D. Bahdanau, and
and definition languages are in Spanish, is actually a set            Y. Bengio. On the Properties of Neural Machine
whose term and definition languages are in German. How-               Translation: Encoder-Decoder Approaches. 2014.
ever, the model will continue to recommend sets with term
                                                                  [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio.
and definition languages in German, since it has learned this
                                                                      Empirical Evaluation of Gated Recurrent Neural
from a user’s prior history. This speaks to the importance
                                                                      Networks on Sequence Modeling. pages 1–9, 2014.
of collaborative features in NERE.
                                                                  [4] P. Covington, J. Adams, and E. Sargin. Deep Neural
   On the whole, we have shown that NERE provides qual-
                                                                      Networks for YouTube Recommendations. Proceedings
ity recommendations with which we can provide a deeply
                                                                      of the 10th ACM Conference on Recommender
personalized experience for learning, and believe this results
                                                                      Systems - RecSys ’16, pages 191–198, 2016.
exceed expectations for our application.
                                                                  [5] W. Dong, C. Moses, and K. Li. Efficient k-nearest
                                                                      neighbor graph construction for generic similarity
5.   CONCLUSION & FUTURE WORK                                         measures. Proceedings of the 20th international
                                                                      conference on World wide web - WWW ’11, page 577,
  In this work, we have proposed Neural Educational Rec-              2011.
ommendation Engine (NERE) to address the problem of
                                                                  [6] C. A. Gomez-Uribe and N. Hunt. The Netflix
personalized sequential recommendation in the Technology
                                                                      Recommender System. ACM Transactions on
Enabled Learning (TEL) domain. By leveraging both content-
                                                                      Management Information Systems, 6(4):1–19, 2015.
based and collaborative features, our model can capture
temporal trends in a user’s history, and provide recommen-        [7] Hillá Meller. SimilarWeb Digital Visionary Awards:
dations as to what they should learn next. By incorporat-             2015, 2015.
ing features such as attention and bidirectionality into our      [8] S. Hochreiter and J. Urgen Schmidhuber. Long
model, we were able to achieve a state of the art recall@100          Short-Term Memory. Neural Computation,
score of 0.54. Moreover, we have performed an analysis of             9(8):1735–1780, 1997.
our model and have shown that it outperforms both a stan-         [9] J. Kirk. TensorRec: A Recommendation Engine
dalone content-based and collaborative approach. Lastly, we           Framework in TensorFlow, 2017.
have shown that our model is learning from both the user         [10] G. Linden, B. Smith, and J. York. Amazon.com
and set metadata, in addition to content, by visualizing the          recommendations: Item-to-item collaborative filtering.
attention mechanism.                                                  IEEE Internet Computing, 7(1):76–80, 2003.
  As to future work, we believe there is significant work left   [11] D. López-Sánchez, J. R. Herrero, A. G. Arrieta, and
to be done in ranking the suggestions; there are significantly        J. M. Corchado. Hybridizing metric learning and
better ways to choose sets from a candidate pool than to rec-         case-based reasoning for adaptable clickbait detection,
ommend the N closest neighbors. Furthermore, we believe               2017.
that an attempt at canonicalizing similar sets would increase    [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
the Recall@100 metric, and should be explored.                        Distributed representations of words and hrases and
                                                                      their compositionality. In NIPS, 2013.
                                                                 [13] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How
6.   ACKNOWLEDGEMENTS                                                 Does Batch Normalization Help Optimization? (No, It
   First and foremost, I would like to thank my mentors               Is Not About Internal Covariate Shift). 2018.
Dustin Stansbury and Shane Mooney for the exceptional            [14] K. Sato. An Inside Look at Google BigQuery. White
support and mentorship throughout this project. Both of               Paper, Google Inc, 2012.
them were supportive, answered my many questions, and            [15] M. Schuster and K. K. Paliwal. Bidirectional recurrent
were quite open to letting me explore. Shane, thank you               neural networks. IEEE Transactions on Signal
for providing much needed practical wisdom, for reviewing             Processing, 1997.
countless pull requests, and for providing much needed com-      [16] D. P. Takamori. Apache Airflow, 2016.
mentary on this paper. Dustin, thank you for the incredible      [17] S. Tang and Z. A. Pardos. Personalized Behavior
knowledge about all things machine learning. This project             Recommendation. Adjunct Publication of the 25th
wouldn’t have been possible without you two.                          Conference on User Modeling, Adaptation and
   I would also like to acknowledge Alex Pinchuk and Shaun            Personalization - UMAP ’17, (July):165–170, 2017.
Mitschrich for providing endless platform support through-
out this project, including honoring my numerous requests
for more compute.
   Lastly, I would like to acknowledge the fabulous Qui-
zlet team who provided incredible companionship through-
out this summer, as well as my parents for supporting me
throughout this process.
   Keep on learning!