<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Educational Recommendation Engine (NERE)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Moin Nadeem</string-name>
          <email>moin.nadeem@quizlet.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dustin Stansbury</string-name>
          <email>dustin@quizlet.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shane Mooney</string-name>
          <email>shane@quizlet.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Quizlet, Inc</institution>
          ,
          <addr-line>501 2nd St, San Francisco, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Quizlet is the most popular online learning tool in the United States, and is used by over 32 of high school students, and 12 of college students. With more than 95% of Quizlet users reporting improved grades as a result, the platform has become the de-facto tool used in millions of classrooms. In this paper, we explore the task of recommending suitable content for a student to study, given their prior interests, as well as what their peers are studying. We propose a novel approach, i.e. Neural Educational Recommendation Engine (NERE), to recommend educational content by leveraging student behaviors rather than ratings. We have found that this approach better captures social factors that are more aligned with learning. NERE is based on a recurrent neural network that includes collaborative and content-based approaches for recommendation, and takes into account any particular student's speed, mastery, and experience to recommend the appropriate task. We train NERE by jointly learning the user embeddings and content embeddings, and attempt to predict the content embedding for the nal timestamp. We also develop a con dence estimator for our neural network, which is a crucial requirement for productionizing this model. We apply NERE to Quizlet's proprietary dataset, and present our results. We achieved an R2 score of 0:81 in the content embedding space, and a recall score of 54% on our 100 nearest neighbors. This vastly exceeds the recall@100 score of 12% that a standard matrix-factorization approach provides. We conclude with a discussion on how NERE will be deployed, and position our work as one of the rst educational recommender systems for the K-12 space.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Founded in 2005, and used by more than 23 of high school
students, Quizlet, Inc. is the largest growing educational
website in the United States [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The interactive platform
permits students to learn any given "set", or collections of
terms and de nitions, in a variety of ways. However, with
over 30 million monthly active users, and 250 million study
Copyright © CIKM 2018 for the individual papers by the papers'
authors. Copyright © CIKM 2018 for the volume as a collection
by its editors. This volume and its papers are published under
sets, it has become nearly impossible for users to sift through
all of the available content. This motivates a need for a
system that will adapt to a user's preferences and make
recommendations on what they should study next, given their
prior history.
      </p>
      <p>
        This is not only motivated from a product perspective, but
also by the rise of personalized learning. As a result of the
rise of personalization in the e-commerce [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], social media
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and dating [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], many in education and research have
grown curious about the implications personalized learning
may have upon students.
      </p>
      <p>Personalized learning can be de ned as any functionality
which enables a system to unique address each individual
learner's needs and characteristics. This includes, but isn't
limited to, prior knowledge, rate of learning, interests, and
preferences. This provides the ability to ensure that each
user's experience is best optimized for their unique needs
and may save them time that would be otherwise wasted.</p>
      <p>For an example that is applicable to Quizlet, one user
may prefer to study content suitable to study with Spell
Mode (where students practice spelling by typing the
spoken word). Our algorithm would take that into account
by biasing recommendations that are commonly studied in
Spell Mode. Similarly, we may expect our algorithm to take
user performance into account, and continue to recommend
topics that the user hasn't quite mastered yet.</p>
      <p>The main contribution of this paper is a deep learning
based system that provides personalized recommendations
to Quizlet users, answering the question "What should I
study next?".</p>
      <p>The rest of this paper is structured as follows: a
summarization of previous literature for (educational)
recommender systems is provided in Section 2. Section 3 provides
an overview of our system architecture, model architecture,
and dataset construction. We continue with a qualitative
and quantitative assessment of our system in Section 4.
Finally, we conclude our paper and provide a direction for
future work in Section 5.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>
        Recommender Systems are a widely studied eld, with
contributions from major players such as Net ix [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Google
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Amazon [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The vast majority of these methods
use matrix factorization techniques to decompose a user's
preferences matrix, and an item ratings matrix into a latent
space that represents how a user may rate a new item; this
latent space is commonly derived from an Alternating Least
Squares (ALS) algorithm.
      </p>
      <p>However, we believe that matrix factorization approaches
aren't well suited for educational applications. To begin,
the user-set matrix is extremely sparse. This makes
standard matrix factorization based methods infeasible. These
methods are also ill suited to material that is sequenced with
temporal dependencies, as is usually the case for educational
material.</p>
      <p>Instead, we attempt to make the problem computationally
tractable by recurrent neural networks and set vectorization,
which are able to learn both temporal dependencies and a
dense representation of our data respectively. The rest of
this section serves to summarize the current state of deep
neural networks with respect to both the current state of
recommender systems, as well as Technology Enabled
Learning (TEL). We rely heavily upon previous contributions from
the intersection of the two elds: Recommender Systems for
Technology Enabled Learning (RecSysTEL).
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Literature Review</title>
      <p>
        Most recently, Tang &amp; Pardos [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] are the only other
researchers in the RecSysTEL eld who have explored the use
of Recurrent Neural Networks (RNNs) for the purposes of
personalization in learning. Their work leveraged RNNs to
model navigational behaviors throughout Massively Open
Online Courses (MOOCs). This research was conducted
with the explicit intention of accelerating or decelerating
learning as a result of performance in a given subject; the
bene t to the user is a reduction in learning time and/or
increased performance.
      </p>
      <p>We believe that this work is quite notable due to the level
of detail included in the model. Interactions as ne-grained
as video pauses and changing video speed are included in
the model as a proxy for mastery. However, Tang &amp; Pardos'
algorithm was purely collaborative, and never leveraged the
content of the MOOC(s) studied. We believe that this is an
underexplored eld in RecSysTEL, and aim for this to be a
major contribution of our work.</p>
      <p>
        Outside of the eld of education, Covington, Adams, and
Sargin [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at YouTube have developed the rst
recommendation system used in an industry setting that leverages deep
neural networks.
      </p>
      <p>Covington et al.'s paper is interesting for two reasons.
First, it demonstrates a successful use of a neural
recommendation system at scale, thus mitigating any concerns
about scaling such a system in production. Secondly, videos
are quite analogous to Quizlet sets: both videos and sets
represent ways to learn about topics, and may be episodic
in nature.</p>
      <p>To provide an example, if a user watched "Full House
Episode 1" on YouTube, a good recommendation would be
"Full House Episode 2". Likewise, a good recommendation
for a user who studied "Hamlet Chapter 1" would be
"Hamlet Chapter 2". In order to generate recommendations such
as these, Covington et al. added search tokens as a feature
to their network.</p>
      <p>In order to deal with the vast swaths of YouTube videos,
Covington et al. split their network into two sub-networks.
One network served to lter a large corpus of videos into
those which the user may be interested in, and the second
network (with access to many more features than the rst)
served to rank these candidates. Finally, their algorithm was
both content-based and collaborative, demonstrating the
viability of a hybrid approach.</p>
      <p>However, one major drawback of their method is the level
of compute with which Google provides Covington et al.
This creates a challenge for us in creating a neural
recommendation system while remaining within realistic
computational resources.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>METHODS</title>
      <p>In this section, we provide an overview of how we
constructed our dataset, what our production system
architecture will be, as well as how NERE is architected in detail.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Dataset Construction</title>
      <p>
        In order to train NERE, Quizlet, Inc. assembled a
proprietary dataset. Internally, we use Google BigQuery [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
for all of our data warehousing needs. From BigQuery, we
assembled two datasets from our activity logs: one which
detailed our users and their respective metadata, and the
second which detailed all sets studied by these users, and
their respective metadata.
      </p>
      <p>The users dataset contained the following elds:
Field
User ID
Study Date
Obfuscated IP Address
Preferred Term Lang
Preferred Def Lang
Preferred Platform
Beginning Timestamp
Ending Timestamp
Set ID
Session Length</p>
      <p>Purpose
Uniquely mapping a row to a user.</p>
      <p>Bias the model to recommend newer content.</p>
      <p>Geo lookup to derive latitude, and longitude for locality.</p>
      <p>Most common language to study terms in.</p>
      <p>Most common language to study de nitions in.</p>
      <p>Most common platform (Web, iOS, etc) to study on.</p>
      <p>Timestamp for when the study session started.</p>
      <p>Timestamp for when the study session ended.</p>
      <p>The set they studied during their session.</p>
      <p>The number of minutes that their study session lasted.</p>
      <p>Once the datasets were assembled, we began cleaning the
data. Since user privacy is quite important to Quizlet's
values, we removed all users below the age of thirteen, and
obfuscated Internet Protocol (IP) addresses by dropping the
last octet. We believe that this is an important step
towards preserving anonymity while still preserving quality
recommendations.</p>
      <p>All categorical variables, such as term language, were mapped
to integers. All continuous variables were scaled between
zero and one (with unit variance) to ensure smooth
gradients. We replaced any missing continuous values with the
mean of the dataset. Lastly, we mapped all IP addresses
to their respective latitude and longitude, with the intuition
that students in close proximity may be studying similar
sets.</p>
      <p>Finally, a preliminary test of NERE with this dataset
found it di cult to model students who were studying for
multiple classes on Quizlet. Intuitively, this makes sense,
as the recurrent neural network is looking for temporal
relations in places where these relations were murky at best.
We solve this by separating sequences by their broad
subject1 column. This was done in practice by concatenating
each User ID with the subject they studied, ensuring each
row is unique in both user and subject classi cation. After
cleaning, we were left with 1,616,004 unique user-subject
combinations to be fed into our model.</p>
      <p>
        To vectorize our Words and De nitions, we took the
spacedelimited string and removed stopwords and non-ASCII
characters. Next, we tokenized it and trained 128-dimensional
GloVe embeddings, which e ectively creates an
implementation of Set2Vec[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. These embeddings were concatenated
along with the preprocessed set metadata to create our set
vectors.
      </p>
      <p>Finally, we transformed our dataset into a timeseries
format by concatenating all user study sessions into a single
axis and sorting by ending timestamp. We chose a session
length of 5 timesteps, since 90% of our users have at least
ve sessions. The dimensions of the resultant datasets are
as follows:</p>
      <p>User Metadata: (1616004, 5, 13)
Set Metadata: (1616004, 5, 12)</p>
      <p>Set Content Vectors: (1616004, 5, 128)
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>System Architecture</title>
      <p>For deployment purposes, we have the following system
architecture.</p>
      <p>
        Quizlet uses Apache Air ow [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the industry standard
for Extract-Transform-Load (ETL) pipelines, to schedule
jobs. Every week, Apache Air ow reads datasets from
BigQuery. Within Air ow, this dataset is preprocessed, and
sent to TensorFlow. TensorFlow predicts which sets the user
should study next, and sends the embedding back to Air ow.
Air ow maps the vectors to sets by determining the N
nearest neighbors of this embedding, and subsequently caches
these recommendations to spanner. Finally, our web server
1The broad subject eld was of the following enumerated
type: Theology, History, Uncommon Languages,
Communications, Formal sciences, Visual Arts, Social Sciences,
Applied Sciences, Vocabulary, German, Performing Arts,
Sports, French, Reading Vocabulary, Spanish, Natural
Sciences, and Geography.
reads these recommendations from Spanner when serving
content. Figure 1 depicts this ow visually.
      </p>
      <p>Our web server reads from this cache when serving user
content. Since the model takes 2ms to predict on each user
with a CPU, we have opted to use a CPU-backed instance
rather than a GPU-backed instance due to infrastructure
cost.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Algorithm</title>
      <p>In this subsection, we rst introduce a formalization of
our set-based recommendation task. Then, we describe our
proposed NERE model architecture in detail.</p>
      <p>Session-based recommendation is the task of predicting
what a user would like to study next when their previous
history and metadata are provided.</p>
      <p>We let X = [s1; s2; s3; :::; sn 1; sn] be a study session,
where si 2 S (1 i n), n is the input length, and S
represents the pool of study sessions. We learn a function
^
f W ( ) such that for any given set of n pre xes, we get an
^
output Y = f W (X).</p>
      <p>
        Since our recommender will need to predict several states
[s0n+1; s1n+1; :::; snm+1] for the (n + 1)th timestep, where m is
the number of recommendations desired, we must be able
to derive several Quizlet sets from Y . We let Y be a
128dimensional vector that represents the content for a
Quizlet set and perform NNDescent [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for a fast, approximate
m-nearest neighbors search algorithm on Y . We nd that
this provides an e cient manner to recommend multiple sets
while maintaining a dense representation for the model to
learn.
3.4
      </p>
    </sec>
    <sec id="sec-8">
      <title>Model Architecture</title>
      <p>Our model consists of 56 layers, 22 of which are inputs to
the model. Figure 2 depicts a portion of our model
architecture.</p>
      <p>In our architecture, we employ quite a few non-standard
layers popular in Natural Language Processing. The
remainder of this subsection will be explaining these layers.
3.4.1</p>
      <sec id="sec-8-1">
        <title>Embedding Layer</title>
        <p>
          In order to provide a dense representation for our
categorical variables, we trained a embedding matrix [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>Each categorical variable Ci 2 C, where C is the set of
categorical variables, was mapped to a 32-dimensional
representation. This was done with the explicit intention that
the model may learn a spatial relation for some of these
variables.</p>
        <p>Each category cj 2 Ci (1 j jCij) is learned using the
following table:</p>
        <p>LTW i (j) = Wji
(1)</p>
        <p>Where W i 2 R32 jCij, jCij represents the number of
categories in Ci, and Wji is the jth column of matrix W i that
represents the 32-dimensional vector corresponding to
category cj . It is important to note that the entirety of this
matrix is randomly initialized, and the vectors are learned
jointly through backpropagation.
3.4.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Bidirectional Layers</title>
        <p>
          Bidirectional Layers [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] are commonly utilized to help
models learn sequences.
        </p>
        <p>The intuition behind bidirectional layers is that it helps
recurrent layers learn sequences by making the context more
explicit. It splits a recurrent layer into a part that is
responsible for learning the input normally, and another part that
is responsible for learning the input backwards; this helps
the model understand what may happen in the future.</p>
        <p>Formally, given some study sequence x1; x2; x3; :::; xn 1; xn,
it would feed [(x1, xn), (x2, xn 1), ..., (xn, x1)] as the input.
At rst sight, one would believe that this leaks information;
however, humans do precisely the same by inferring future
states from previous experience.
3.4.3</p>
      </sec>
      <sec id="sec-8-3">
        <title>Attention With Context</title>
        <p>Based o of the work of Yang et al., Attention With
Context is a mechanism that helps the model learn which
features are important, and which ones may be discarded. As
the name may imply, it helps the model pay attention.</p>
        <p>Formally, we add a new layer that performs the following
operation. We assume that i is the ith timestamp in our
input, and t is the tth element in the vector i. Lastly, hit
is the output of the ith element of the tth timestamp in
the layer that precedes our attention layer. The following
equations describe the operations of the Attention layer:
uit = tanh(Wwhit + bw)
(2)</p>
        <p>exp(uiuw)</p>
        <p>Pt exp(uituw)
si = X</p>
        <p>ithit
i
(3)
(4)</p>
        <p>Where uw is a learned feature-level attention vector, Ww
are the weights of the attention layer, and it is a weighted
tth element of the ith vector. Intuitively, this
implementation makes a lot of sense: the model is computing how
important each feature in each timestep is against all other
features in the same timestep, and re-weighing the input
accordingly. All weights in this layer are randomly initialized
and jointly learned throughout the training process.
3.4.4</p>
      </sec>
      <sec id="sec-8-4">
        <title>Miscellaneous Features</title>
        <p>
          While most other works have used Long-Short Term
Memory (LSTM) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] cells for their recurrent unit, we chose to
use Gated Recurrent Unit (GRU) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] cells. As Chung, et al.
show in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], for short sequences, GRU cells commonly are
more practical due to not having an internal memory. We
saw a noticeable speed up of more than 20% when using a
GRU cell over an LSTM.
        </p>
        <p>
          In order for these models (over 5,994,444 learnable
parameters) to generalize, we had to apply some strict
regularization. We applied 50% dropout on layers following a
recurrent cell, and applied 0.001 L2 regularization on the
recurrent kernel itself. Furthermore, we used batch
normalization to ensure that our inputs are zero-centered with
normalized variance. Following the results of Santurkar et
al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we also noticed faster training times as a result of
these smoother gradients.
4.
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>RESULTS</title>
      <p>In this section, we evaluate NERE from a qualitative and
quantitative perspective. We compare our model against a
baseline matrix factorization approach, and analyze several
variations of the model for the purposes of introspection.</p>
      <p>Table 3 shows the qualitative results of our
recommendation system. The studied column shows the set that the
user studied, while the recommendation column shows the
set that was recommended for the user to study. For this
particular recommendation, our system understands that a
student had been learning about discussing time (in terms
of days of the week) in French, and recommended a
corresponding set about months of the year. This shows that the
model understands that the user is learning about temporal
relations. On a higher level, this demonstrates a level of
understanding of both the content that a user desires to learn
and the di culty at which he desires to learn it.</p>
      <p>We use two proxies to assess model accuracy: recall@100
and R2. In order to compute recall@100, we take the 100
nearest neighbors of our output embedding, and check if the
set that the learner studied at timestep Tn+1 is in the set
of nearest 100 neighbors. If it is, we mark that
recommendation as correct; otherwise, it is incorrect. We use the 100
nearest neighbors due to the density of our embedding space,
as well as the fact that many of the sets in our embedding
space are near-duplicates due to a lack of canonicalization.</p>
      <p>We use R2 to assess whether the predictions in the
embedding space match the actual distribution; this serves as a
sanity check to ensure that our model's output distribution
is correlated to the expected distribution.</p>
      <p>
        We compare the performance of NERE against that of
TensorRec [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a library written by James Kirk that uses
the Tensor ow API. TensorRec accepts a user matrix, item
matrix, and interactions matrix as inputs, and formulates
a predictions matrix as an output. For the user matrix,
we provide the user metadata matrix that NERE is
provided. We concatenate the set vectors and set metadata,
and this represents the item matrix. Lastly, we create an
interactions matrix of dimensions (jU SERSj, jSET Sj), where
some (i; j) = 1 if user i studied set j.
      </p>
      <p>We trained TensorRec on this dataset, and it obtained
a Recall@100 of 0.12 after convergence. We believe this
validates our belief in a core di erence between a matrix
factorization approach and our approach: even after
extensive customization, an approach based o of temporal data
is much more likely to provide quality recommendations for
educational content.</p>
      <sec id="sec-9-1">
        <title>4.0.2 Input Sequence Length</title>
        <p>Our NERE model is based o of the assumption that a
user is purposefully selecting sets to study, and topically
related to a greater theme. This permits us to also believe
that the sets are temporally related, and therefore, enables
us to use a recurrent neural network.</p>
        <p>Figure 3 validates this assumption by comparing model
performance against the input sequence length. We see that
the R2 score slowly converges, but that the recall@100
metric steadily increases until our fourth input sequence. This
implies that there may be performance advantages to be
obtained by increasing the length of the input sequence past
four. However, since we begin to lose a signi cant number
of users in our dataset if we extend beyond ve timesteps,
we risk creating a model that will not generalize to our
entire userbase. As a result, we believe that ve timesteps is a
good balance between desired accuracy and generalizability.
4.0.3</p>
      </sec>
      <sec id="sec-9-2">
        <title>Where’s the Attention</title>
        <p>One popular use of attention in deep neural networks is
to visualize the model's understanding of the input. Figure
3 visualizes how the model pays attention to the input, as
well as how it learns the attention vector over time. Brighter
rectangles indicate that more attention is being placed on
those blocks.</p>
        <p>These results show incredible insight into the decision
process of the model. We can see that at the beginning of the
input, the model focuses on the metadata; aspects such as
term and de nition language are deemed incredibly
important. However, as time goes on, the attention shifts from set
and user metadata towards content-based features. We see
that the attention in the very last timestep shifts towards
the content, which aligns with our expectations.
4.0.4</p>
      </sec>
      <sec id="sec-9-3">
        <title>A Purely Content/Collaborative Approach</title>
        <p>Next, we try and understand how important our features
are to the model.</p>
        <p>We train and test two variations, with and without the
128-dimensional content vectors, to see how important a
content-based approach is for NERE. The impacts of these
variations are demonstrated in Table 4.</p>
        <p>R2</p>
        <sec id="sec-9-3-1">
          <title>Recall@100</title>
        </sec>
        <sec id="sec-9-3-2">
          <title>Both</title>
          <p>0.81
0.54</p>
        </sec>
        <sec id="sec-9-3-3">
          <title>Content</title>
          <p>0.78
0.38</p>
        </sec>
        <sec id="sec-9-3-4">
          <title>Metadata</title>
          <p>0.55
0.001</p>
          <p>This shows that a hybrid (both collaborative and
contentbased) is clearly superior over either one independently. It
is important to notice that a content-based approach will
obtain a high R2 score, since it is easy for the model to
learn the underlying distribution, but will not recommend
the appropriate set. This demonstrates the importance of
various collaborative features that we explicitly include.</p>
          <p>For example, the nearest neighbor for a set whose term
and de nition languages are in Spanish, is actually a set
whose term and de nition languages are in German.
However, the model will continue to recommend sets with term
and de nition languages in German, since it has learned this
from a user's prior history. This speaks to the importance
of collaborative features in NERE.</p>
          <p>On the whole, we have shown that NERE provides
quality recommendations with which we can provide a deeply
personalized experience for learning, and believe this results
exceed expectations for our application.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION &amp; FUTURE WORK</title>
      <p>In this work, we have proposed Neural Educational
Recommendation Engine (NERE) to address the problem of
personalized sequential recommendation in the Technology
Enabled Learning (TEL) domain. By leveraging both
contentbased and collaborative features, our model can capture
temporal trends in a user's history, and provide
recommendations as to what they should learn next. By
incorporating features such as attention and bidirectionality into our
model, we were able to achieve a state of the art recall@100
score of 0.54. Moreover, we have performed an analysis of
our model and have shown that it outperforms both a
standalone content-based and collaborative approach. Lastly, we
have shown that our model is learning from both the user
and set metadata, in addition to content, by visualizing the
attention mechanism.</p>
      <p>As to future work, we believe there is signi cant work left
to be done in ranking the suggestions; there are signi cantly
better ways to choose sets from a candidate pool than to
recommend the N closest neighbors. Furthermore, we believe
that an attempt at canonicalizing similar sets would increase
the Recall@100 metric, and should be explored.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
      <p>First and foremost, I would like to thank my mentors
Dustin Stansbury and Shane Mooney for the exceptional
support and mentorship throughout this project. Both of
them were supportive, answered my many questions, and
were quite open to letting me explore. Shane, thank you
for providing much needed practical wisdom, for reviewing
countless pull requests, and for providing much needed
commentary on this paper. Dustin, thank you for the incredible
knowledge about all things machine learning. This project
wouldn't have been possible without you two.</p>
      <p>I would also like to acknowledge Alex Pinchuk and Shaun
Mitschrich for providing endless platform support
throughout this project, including honoring my numerous requests
for more compute.</p>
      <p>Lastly, I would like to acknowledge the fabulous
Quizlet team who provided incredible companionship
throughout this summer, as well as my parents for supporting me
throughout this process.</p>
      <p>Keep on learning!</p>
    </sec>
    <sec id="sec-12">
      <title>REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Brozovsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Petricek</surname>
          </string-name>
          .
          <source>Recommender System for Online Dating Service</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merrienboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <source>On the Properties of Neural Machine Translation: Encoder-Decoder Approaches</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <source>Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. pages 1{9</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Sargin</surname>
          </string-name>
          .
          <article-title>Deep Neural Networks for YouTube Recommendations</article-title>
          .
          <source>Proceedings of the 10th ACM Conference on Recommender Systems - RecSys '16</source>
          , pages
          <fpage>191</fpage>
          {
          <fpage>198</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Moses</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>E cient k-nearest neighbor graph construction for generic similarity measures</article-title>
          .
          <source>Proceedings of the 20th international conference on World wide web - WWW '11, page 577</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Gomez-Uribe</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Hunt</surname>
          </string-name>
          .
          <article-title>The Net ix Recommender System</article-title>
          .
          <source>ACM Transactions on Management Information Systems</source>
          ,
          <volume>6</volume>
          (
          <issue>4</issue>
          ):1{
          <fpage>19</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Hilla</given-names>
            <surname>Meller. SimilarWeb Digital Visionary Awards</surname>
          </string-name>
          :
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. Urgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          . Long
          <string-name>
            <surname>Short-Term Memory</surname>
          </string-name>
          .
          <source>Neural Computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirk. TensorRec: A Recommendation Engine Framework in TensorFlow</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Linden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>York</surname>
          </string-name>
          . Amazon.
          <article-title>com recommendations: Item-to-item collaborative ltering</article-title>
          .
          <source>IEEE Internet Computing</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <volume>76</volume>
          {
          <fpage>80</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lopez-Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Herrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Arrieta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Corchado</surname>
          </string-name>
          .
          <article-title>Hybridizing metric learning and case-based reasoning for adaptable clickbait detection</article-title>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and hrases and their compositionality</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Madry. How Does Batch Normalization Help Optimization</surname>
          </string-name>
          <article-title>? (No, It Is Not About Internal Covariate Shift)</article-title>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sato</surname>
          </string-name>
          .
          <article-title>An Inside Look at Google BigQuery</article-title>
          . White Paper, Google Inc,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          and
          <string-name>
            <surname>K. K. Paliwal.</surname>
          </string-name>
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Takamori</surname>
          </string-name>
          . Apache Air ow,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Pardos</surname>
          </string-name>
          .
          <source>Personalized Behavior Recommendation. Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization - UMAP '17</source>
          , (July):
          <volume>165</volume>
          {
          <fpage>170</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>