Privacy-Aware Personalized Entity Representations for
                        Improved User Understanding
                   Levi Melnick Hussein Elmessilhy Vassilis Polychronopoulos Gilsinia Lopez
                            Yuancheng Tu Omar Zia Khan Ye-Yi Wang Chris Quirk
                                                              Microsoft
                           {lemeln,huahme,vapolych,gilopez,yuantu,omkhan,yeyiwang,chrisq}@microsoft.com

ABSTRACT                                                                               access-controlled documents that are not publicly available. Our
Representation learning has transformed the field of machine learn-                    goals are to help each user find, classify, and act upon these grow-
ing. Advances like ImageNet, word2vec, and BERT demonstrate the                        ing information stores, then to acquire and organize information,
power of pre-trained representations to accelerate model training.                     including facts and relationships among these entities. A crucial en-
The effectiveness of these techniques derives from their ability to                    abling step is to build reusable representations of this information.
represent words, sentences, and images in context. Other entity                           Most representation learning uses large, publicly-available docu-
types, such as people and topics, are crucial sources of context in                    ment stores to build generic embeddings. We believe there is also
enterprise use-cases, including organization, recommendation, and                      great value in user-conditioned representations: representations of
discovery of vast streams of information. But learning represen-                       phrases and contacts for each user learned on the information
tations for these entities from private data aggregated across user                    uniquely available to that user. First, building user-conditioned rep-
shards carries the risk of privacy breaches. Personalizing represen-                   resentations provides a huge amount of context. Often when there
tations by conditioning them on a single user’s content eliminates                     are ambiguous or overloaded concepts, the key people surrounding
privacy risks while providing a rich source of context that can                        their usage can disambiguate. Furthermore, a given user may extend
change the interpretation of words, people, documents, groups,                         the meanings of a given concept as they document and communi-
and other entities commonly encountered in workplace data. In                          cate new ideas. Perhaps most importantly, training a model based
this paper, we explore methods that embed user-conditioned repre-                      on only the communications and documents available to a given
sentations of people, key phrases, and emails into a shared vector                     user provides a clear and intuitive notion of privacy. Whenever we
space based on an individual user’s emails. We evaluate these rep-                     train on data beyond any user’s normal visibility, there is some po-
resentations on a suite of representative communication inference                      tential for capturing and surfacing information outside their view.
tasks using both a public email repository and live user data from                     Differential Privacy helps limit the exposure of any individual user,
an enterprise. We demonstrate that our privacy-preserving light-                       but preventing leakage across groups is more difficult. For instance,
weight unsupervised representations rival supervised approaches.                       certain privileged information may be discussed heavily by many
When used to augment supervised approaches, these representa-                          members of an administrative board, yet this information should
tions are competitive with deep-learned multi-task models based                        not be shared broadly across the whole organization. When training
on pre-trained representations.                                                        a user’s model on only data that that user can see, the possibility
                                                                                       for leaking information is removed. From the perspectives of both
                                                                                       leveraging a crucial signal as well as maintaining user privacy and
1    INTRODUCTION                                                                      trust, user-conditioned representations hold great promise.
                                                                                          User-conditioned learning comes at a cost. Data density de-
   Pre-trained embeddings are a crucial technique in machine learn-                    creases dramatically. State-of-the-art deep learned representations
ing applications, especially when task-specific training data is                       typically train on billions of tokens [12], whereas an individual
scarce. For instance, groundbreaking work in image captioning                          user’s inbox may only have a few thousand emails. Thus, we ex-
was enabled by reusing the penultimate layer of an object recog-                       plore shallower personalized approaches with lower sample com-
nition system to summarize the content of an image[24]. More                           plexity (though shallow models can be mixed with deep generic
recently, contextualized embeddings are setting the state-of-the-art                   models for empirical gains [10]). Furthermore, training must be
in a range of natural language processing tasks [12]. Training mod-                    performed for every user separately within the organization; in our
els to extract reusable representations from data is now an obvious                    case, this entails separate training runs for hundreds of millions of
investment. The next key research question is which context to                         users. Because the information available to the user is constantly
leverage.                                                                              changing, maintaining fresh representations is also a challenge.
   Our research is situated in the area of User Understanding: or-                        As computation and storage become cheaper, the overhead of
ganizing the information, documents, and communications that                           maintaining user-conditioned models is tractable only if the models
are available to each user within an organization. Users now com-                      are light-weight. Furthermore, we focus on task-agnostic repre-
monly retain huge mailboxes of written communication; members                          sentations that benefit a range of scenarios, amortizing the cost of
of larger organizations also have access to large repositories of                      computation. Finally, using models trained only on one user’s data
                                                                                       benefits privacy, which is an increasing concern for organizations
PrivateNLP 2020, Feb 7, 2020, Houston, Texas                                           and individuals.
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
PrivateNLP 2020, Feb 7, 2020, Houston, Texas                       Melnick, Elmessilhy, Polychronopoulos, Lopez, Tu, Khan, Wang, and Quirk


1.1    Contributions                                                       previously favored items [1]. Our entity representations also embed
We present the first efforts in building per-user representations:         phrases with contacts, but they are task-agnostic.
content-based representations that embed disparate entities, includ-           Personalized language models have shown benefits in speech
ing contacts and key phrases, into a common vector space. These            recognition [26], dialog [23], information retrieval [38], and col-
entity representations are different for each user: the same key           laborative filtering [19]. These approaches model the user, but the
phrase and contact may have very different representations across          representations are not generated on a per-user level; instead, all
two users depending on their context. We focus on slow-changing            users share the same representation for an item or phrase. In our
entities like contacts and key phrases to minimize the impact of           approach, the same item (phrase or contact) will have a different
delayed retraining: although one’s impression of their collaborators       representation according to each user, since the entity representa-
may shift over years, months, or perhaps weeks, a representation           tions are generated per user only considering the data available to
that is a few days old is still useful. To embed rapidly arriving          that user. Nawaz et al. [29] present a technique to perform social
and changing items such as documents and emails, we present ap-            network analysis to identify similar communities of contacts within
proaches that assemble representations of rapidly changing entities        a user’s own data but their approach is limited to a single task. Yoo
from their content, including related contacts and key phrases.            et al. [41] describe an approach for obtaining representations of
   We evaluate these representations on a range of downstream              emails using personalized network connectivity features and con-
tasks, including action-prediction and content-prediction. Simple,         tacts. Their representations are used as inputs to machine learning
unsupervised approaches, especially non-negative matrix factoriza-         models to predict the importance of emails. In our approach, key
tion (NMF) [25], produce substantial improvements in accuracy, out-        phrases, contacts, and emails are all embedded in the same space.
performing task-specific and multi-task neural network approaches.         Thus, we can use the task-agnostic representations for a variety of
   We compare user-conditioned representations to representations          tasks and obtain similarity scores for any pair of entities.
learned at the organization-level, where data from multiple users              Starspace [39] introduces a neural-based embedding method that
are combined together into a larger undifferentiated store. User-          maps different types of entities (graphs, words, sentences, docu-
conditioned representations mostly outperform the organization-            ments, etc.) to the same space. While the entities are embedded
level approaches, despite decreased data density, presumably be-           in the same space, like our work, their training uses global infor-
cause the additional context provides helpful signals to models. Fur-      mation, and the loss function on which the network is trained is
thermore, user-conditioned representations sidestep issues related         task-dependent. Our approach allows reusing the same representa-
to privacy preservation by not mixing data across user mailboxes.          tions obtained using local data across a variety of tasks.
                                                                               Our evaluation tasks bear some similarity to the evaluation of
                                                                           knowledge base completion through embeddings of text and en-
2     RELATED WORK                                                         tities [36]. However, our entities are not curated knowledge base
Email mining [35] has been widely studied from different angles            entities; they are phrases and people known to a particular user.
both for content and action classification. Spam detection has re-             In query expansion, locally trained embeddings can outperform
ceived considerable attention both from a content identification           global or pre-trained embeddings [13, 32] by incorporating more
and filtering view [8] as well as from a process perspective [16].         relevant local context. We exploit a similar insight in training only
Folder prediction is another task that can help better organize in-        on the user’s own data and directly incorporating her context. Amer
coming emails [22]. Email content has also been used for social            et al. [3] present an approach that trains a per-user representation,
graph analysis to learn associations between people, both for re-          though their per-user embeddings perform worse than global or
cipient prediction [4] and sender prediction [17]. Action prediction       pre-trained embeddings. In this paper, we demonstrate methods to
tasks that have been considered in the context of emails include           train per-user representations that can not only outperform global
reply prediction [40], attachment prediction [15, 37], and generic         representations but also generate them in a task-agnostic manner.
email action prediction [6]. In this paper, we use recipient pre-
diction, sender prediction, and reply prediction as representative
tasks to evaluate the quality of our learned representations. All          3    ENTITY REPRESENTATIONS
prior work on these and similar tasks has relied on per-task feature
                                                                           We developed representations for three entity types: key phrases,
engineering and supervised model training. We show how entity
                                                                           contacts, and emails. Key phrases (typically noun phrases) [18] can
representations generated in a task-agnostic manner can be used
                                                                           appear anywhere in the body or subject of an email. Restricting
both in an unsupervised and in a supervised setting for these tasks.
                                                                           to extracted key phrases limits the total number of entities for
   Entity representations have been used extensively to provide
                                                                           which representations must be learned. By “contacts,” we refer
personalized recommendations. Most such models build global rep-
                                                                           to individual email addresses that appear in the From, To, or CC
resentations of users and items in the same latent space and then
                                                                           fields of an email. For slow-changing entities like key phrases and
determine the similarity between the user and item through cosine
                                                                           contacts, we can periodically regenerate a stored representation. We
similarity. The user embeddings can be built in a collaborative fil-
                                                                           represent fast-changing entities such as emails with light-weight
tering setting by leveraging a user’s past actions such as clicks [30],
                                                                           compositions of pre-trained entity embeddings (contact and key
structured attributes items interacted with, ratings offered on past
                                                                           phrase embeddings) to minimize computational expense.
items [5], or even past search queries [2]. An extension to such
                                                                              Since the median inbox size will be small (in both our data sets it
approaches is to combine embeddings of words or phrases with
                                                                           is around seven thousand emails) there is not enough data per user
other types of data [33], such as embeddings of users [27] or their
Privacy-Aware Personalized Entity Representations for Improved User Understanding                                          PrivateNLP 2020, Feb 7, 2020, Houston, Texas
                                                                                                  Words + contacts + key phrases

to train useful deep learned representations. Indeed, our early at-                                          the       item           …       john@doe.com


                                                                         Emails
tempts to train user-conditioned word2vec [28] embeddings yielded                                 mail1      4         1               1
poor results and are not reported here. Therefore we only consid-                                                      Original corpus:
                                                                                                  mail2      2         0TF-IDF matrix 0
ered approaches that were likely to perform well at low data density.
                                                                                                  …                                   ⋱
                                                                                                                                                                 Aggregate information
3.1    Key Phrase and Contact Representations                                                     mailn      8         0                      1                  involving all contacts
                                                                                                                                                                 and key phrases
We compute unsupervised entity representations for contacts and
key phrases by associating them with concatenated documents.


                                                                         Contacts + key phrases
                                                                                                                 the          item        …       john@doe.com
These concatenated documents are assembled from the user’s orig-                                  x@y.com        0.1          0.3                 0.25
inal documents; the emails in our experiments have From, To, CC,
                                                                                                  a@b.net        0.2          0                   0
Body, and Subject fields. Given a particular entity e, its concate-
                                                                                                  …                                       ⋱
nated document de is the concatenation of every email m in a user’s
inbox such that e appears in any of those fields. This concatenation                              due date       0.3          0                   0.30
is done on each field f independently: every new concatenated                                                   Reduce dimensionality of
document de will have a corresponding field de, f for each field f                                           concatenated document matrix
                                  É
in the original document. We use      to denote concatenation.
           Ê
   de, f =       m f , where Me = {m : ∃f such that e ∈ m f } (1)                                                  Words + contacts + key phrases


                                                                        Contacts + key phrases
          m ∈M e
                                                                                                                                          H
Stop words on the Scikit-learn stopword list are removed, as are
terms that appear in more than 30% or less than 0.25% of training
                                                                                                      W
emails. We then generate a sparse numerical entity-by-term matrix,
using TF-IDF for most methods or just TF matrix in the case of LDA.
Initially one matrix is computed for each field f using the relevant                              Entity                          Compute Pseudo-inverse:             H+
                                                                                                  Representation                  Map from documents
portion of the concatenated document collection D f = {de, f }e .                                                                 into embedding space
Each matrix is scaled according to a weighting factor w f to balance
its contributions, and finally these matrices are concatenated to
form a single matrix T .
            Êp                                    Õ                     Figure 1: Process of creating concatenated documents to represent
        T =       w f · term-matrix(D f ), where      wf = 1      (2)   contacts and key phrases given the count matrix from an email cor-
             f                                     f                    pus. We also demonstrate how this matrix can be factorized into a
                                                                        low rank approximation to encourage inference over sparse items.
The weights of the different email fields are treated as hyperpa-       The left matrix W can be interpreted as entity representations. Fur-
rameters and tuned empirically to perform well on the evaluation        thermore, we can derive a mapping from the words, phrases, and
tasks. We found that weights of 0.4, 0.3, 0.2, 0.05, and 0.05 for the   contacts in an email into this representation space using the pseu-
Body, Subject, From, To, and CC fields worked well. The rows of T       doinverse H + of the right matrix H . Other low rank approximation
are sparse representations of entities – a simple and safe baseline.    and composition approaches are explored as well.
We explored LDA, LSA, and NMF as a means of encouraging softer
matching through dimensionality reduction.                              constraining the low-rank matrices to be positive and adding a reg-
   3.1.1 TF-IDF. Our baseline representation technique is sparse        ularization term [25]. Specifically, given an input matrix T ∈ Rm×n ,
unigram TF-IDF vectors produced from the concatenated docu-             we try to find matrices W ∈ Rm×d and H ∈ Rd ×n to minimize
ments.
   3.1.2 Latent Dirichlet Allocation (LDA). Latent topic models us-                                                        |T − W H | + λ (|W | + |H |)                              (3)
ing LDA [7] over the term frequency matrix of the concatenated
documents (not the TF-IDF matrix) learn a mixture of topics for         where λ is a regularization weight and | · | is the Frobenius norm.
each document. These learned vectors can act as entity represen-        The W matrix serves as a representation for the entities. We ef-
tations. We can vary the number of latent topics to determine the       ficiently compute NMF through the Hierarchical Least Squares
dimensionality of the resulting embeddings.                             algorithm [21].
   3.1.3 Latent Semantic Analysis (LSA). A classic method for re-
ducing sparsity, LSA [11] builds a low-rank approximation of a TF-      3.2                           Email Representations
IDF matrix T using the singular value decomposition: T = U ΣV T .       The vocabulary of key phrases and contacts in one’s mailbox is
   3.1.4 Non-negative Matrix Factorization (NMF). The SVD re-           likely to grow slowly, and their meanings and relationships will also
construction has a few problems: the values of the matrix may be        evolve gradually. By comparison, many new emails arrive every
positive or negative, and there is no explicit regularization term.     day, so the “vocabulary” of email entities is constantly increasing.
Together, these issues may lead to strange or divergent weights,        Thus, while it is possible to train representations for email in the
especially when the data is difficult to model with lower rank.         same way that we do for key phrases and contacts, updating email
Non-negative matrix factorization (NMF) addresses these issues by       representations on an ongoing basis would imply vast storage and
PrivateNLP 2020, Feb 7, 2020, Houston, Texas                                               Melnick, Elmessilhy, Polychronopoulos, Lopez, Tu, Khan, Wang, and Quirk


   From:            terrie.james@enron.com                                                                                  |V|
   Subject:         Enron Prize                            Replied to: False                                                                   300
   To:              kenneth.lay@enron.com                                                                                                                      100
   Cc:              cindy.olson@enron.com, kelly.kimberly@enron.com,                                                                                                               |Y|


                                                                                                                                                     Batchnorm
                                                                                                                                       Embed
                    karen.denne@enron.com, christie.patrick@enron.com,


                                                                                                                                                       RELU
                                                                                                                   Email
                    rosalee.fleming@enron.com                                                        (a)                                                               Dropout,σ
   Date:            Mon, 12 Nov 2001 16:51:33 -0800 (PST)

   Ken,
                                                                                                                            500
   I wanted to let you know that I spoke with W.O. King at the Baker Institute earlier
   today. Everything is ready for tomorrow's activities.


                                                                                                           representation
                                                                                                               Email
   There will be a great deal of media coverage for the event. Many media outlets will
   be accessing the live video feed of Chairman Greenspan's speech. However, the
                                                                                                                                               500               300
   media will not be able to interview any participants directly. Chairman Greenspan


                                                                                                                                  Batchnorm
   will only answer written questions submitted by the audience (including media), and
                                                                                                     (b)                                                                             1


                                                                                                                                                     Dropout
                                                                                                                                    RELU


                                                                                                                                                      RELU
   those questions will be vetted twice. I will be receiving the media advisory that has                                    500
   been distributed and the most current media attendee list from the Baker Institute                                                                                   Dropout,σ
   tomorrow. I will provide that to you as early as possible.


                                                                                                           Target Entity
                                                                                                            Candidate
   You probably know that, because of his position, Chairman Greenspan can not
  accept 2:
Figure   theExample
            prize itself, but only the
                           email       "honor"
                                    with       of being, named
                                           sender              an Enron, Prize
                                                          recipients       keyrecipient.
                                                                                phrases ,
   For that reason, the Enron Prize will not be present on stage during the ceremony.
and replied-to annotations. Each task is constructed by obscuring
   Karen Denne, Christie Patrick and I will be attending. Please let me know if you
a relevant   entity, then reconstructing it given the remaining context.
   have any questions or need additional information prior to the event.                                                    |V|
                                                                                                                                               300                            σ          |Y1|
   See you there,                                                                                                                                              100


                                                                                                                                                     Batchnorm
   Terrie
computation requirements. So we handle emails differently, com-


                                                                                                                                      Embed
                                                                                                                  Email


                                                                                                                                                       RELU
puting representations on demand through compositions of other                                       (c)                                                               Dropout,σ         |Y2|

entity representations. In this paper, we explored four different
email composition models: Centroid, Pointwise Max, Pseudoin-                                                                                                                  σ          |Y3|
verse, and the combination of Centroid and Pseudoinverse.
   3.2.1 Centroid. One simple email representation is the average                                  Figure 3: Task-specific neural network architectures: (a) multiclass
of the representations of all key phrases and contacts in an email.                                model for predicting which entity is present in a given email; (b)
   3.2.2 Pointwise Max. Another commonly used pooling opera-                                       binary matching model for predicting whether a given entity is
tion is max – we retain the largest value along each dimension. This                               present in a given email; (c) multi-task multiclass model jointly
                                                                                                   trained on all evaluation tasks.
approach increases the sensitivity to strongly-weighted features in
the underlying key phrase and contact representations.
                                                                                                   email. We use the cosine similarity between the email represen-
   3.2.3 Pseudoinverse. The H matrix from Equation 3 can serve                                     tation and the twenty candidate target entity representations to
as a map from the low-rank concept space into the word/entity                                      predict the true target. Reply prediction is treated as a binary clas-
space. Although H is not a square matrix and hence not invertible,                                 sification problem, using email representations as input features.
the Moore-Penrose pseudoinverse of H , namely H + , can act as a                                   Entity representation methods that yield more accurate predictions
map from email content into the entity representation space. We                                    are considered superior.
multiply the TF-IDF vector associated with a given email by H + to                                     These tasks readily suggest real life applications. Recipient rec-
project into the entity representation space. Unlike the previous                                  ommendation is already a standard feature in many email clients.
two models, this has the benefit of including information from                                     Similarly, an email client may predict whether an email from an
non-key phrase unigrams from the email.                                                            unfamiliar address comes from a known sender and prompt the
   3.2.4 Centroid + Pseudoinverse. Centroid and pseudoinverse                                      user to add the new address to that sender’s contact information.
representations are summed to combine the benefits of each.                                        Predicting latent associations between emails and key phrases en-
                                                                                                   ables automatic topic tagging and foldering. Finally, an email client
4 EVALUATION METHODOLOGY                                                                           may use reply prediction to identify important emails to which an
                                                                                                   inbox owner has not yet responded and remind the user to reply.
4.1 Evaluation Tasks
We evaluate entity representations according to their performance                                     4.1.1 Task-Specific Model Architectures. We aim to construct
on four email mining tasks: sender prediction, recipient prediction,                               task-agnostic user-conditioned representations: they should be use-
related key phrase prediction, and reply prediction. The first three                               ful across a variety of tasks without having to be tuned to each
tasks are content prediction tasks, whereas in reply prediction we                                 one separately. While this makes the representations reusable and
use the email content to predict a user action.                                                    reduces computational expense, separate models trained on each
   Content prediction tasks are formulated as association tasks. We                                specific task often perform better. To evaluate this tradeoff, we com-
remove a target entity from an email and randomly select nineteen                                  pare the unsupervised similarity-based method described above
distractor entities from the user’s inbox not already present in the                               to supervised task-specific baseline models trained on each of the
Privacy-Aware Personalized Entity Representations for Improved User Understanding                  PrivateNLP 2020, Feb 7, 2020, Houston, Texas


association tasks. We also evaluate how well the user-conditioned                                Avocado (55 users)              Enterprise (53 users)
representations perform as feature inputs to task-specific models,                             Max       Min    Average         Max      Min    Average
since their utility as feature representations is a key consideration.
    To train a task-specific model for sender, recipient, or key phrase    Emails/User       19,000    3,561           7,887   17,490   2,872      8,451
                                                                           Phrases/User       9,632    3,324           5,308    8,137   3,433      6,772
prediction, we reformulate these association tasks as classification
                                                                           Contacts/User        376       95             210    2,375     357      1,431
problems. In each case, we train the classifier to predict the target      Reply Rate          0.34     0.01            0.14     0.60    0.01       0.19
entity using its email representation. As above, we remove a target
                                                                                Table 1: Email statistics for Avocado and enterprise users.
entity from an email and select nineteen distractors. Instead of
cosine similarity, we use the trained classifier to score the twenty
candidate entities and predict the one with the highest score.
    We experimented with a variety of modeling techniques for both        report the average recall, an efficient measure for skewed distri-
task-specific baseline models and task-specific models trained on         butions. To obtain the average recall, we calculate the recall for
entity representations. The best results consistently came from           each possible target: the percentage of times it was successfully
simple two-layer feed forward neural classifiers using ReLU activa-       predicted. We then report the average recall over all targets with-
tions, a sigmoid output layer, batch normalization, drop out [34],        out weighing the frequency of the target. Together, accuracy and
and trained using cross-entropy loss and Adam [14]. However, each         average recall provide a reliable measure of the association. If one
scenario achieved best results using slightly different task formula-     method boosts accuracy by only learning about frequent targets,
tions and architectures.                                                  the average recall will be impacted negatively. Similarly a reduced
    The baseline models were formulated as multiclass classifiers,        recall of the frequent targets will impact the accuracy.
as depicted in Figure 3a. Emails are represented as binary vec-              For reply prediction, we report the area under the precision-recall
tors with each element representing the presence or absence of a          curve (PR-AUC), which is useful even when classes are imbalanced.
unigram or contact. These vectors index into a 300 dimensional
embedding layer initialized with pre-trained GloVe vectors [31]; out-     4.3     Evaluation Corpora
of-vocabulary items received random initializers. The embedding           We evaluate our techniques on two separate repositories of emails,
layer was also trained, allowing the model to learn representations       Avocado emails and live user emails from a large enterprise. The
for out-of-vocabulary terms. We experimented with two variants:           properties for each corpus are listed in Table 1. For the first reposi-
one in which contacts were included as features (“Pre-trained +           tory, we use mailboxes from the Avocado Research Email Collec-
Contacts”) and one in which they were not (“Pre-trained”).                tion1 . For the second dataset, we use live user email data from a
    For models trained on entity representations, shown in Figure 3b,     real-world enterprise with thousands of users (called enterprise
we found the best results by treating the candidate target entity         users from here on for brevity). These emails are encrypted and off-
representations and the email representations as separate inputs.         limits to human inspection. We randomly select a set of users who
These 500 dimensional representations are passed through two              are related to each other by sampling from the same department.
dense layers of width 500 and 300 respectively and a sigmoid out-         This increases the possibility of overlap between users and allows
put layer, which returns a score representing the likelihood that         some shared context. This property will be helpful when we want
the input entity is indeed present in the input email. In Tables 4        to compare a global model versus user-conditioned representations.
and 5, “TF-IDF + NMF Centroid” and “TF-IDF + NMF Centroid +                  For both datasets, we filter out users with fewer than 3,500 or
Pseudoinverse” model variants both share this architecture.               greater than 20,000 emails. Users with more than 20,000 emails were
    We also considered a multitask model jointly trained on all four      outliers and, in the enterprise dataset, were likely to have many
evaluation tasks. This model, shown in Figure 3c, is identical in         machine generated emails, which can make the evaluation tasks
its architecture and training to the task-specific baseline model in      easier. We set the minimum number of emails to 3,500 somewhat
Figure 3a except that instead of one output layer it has |N | output      arbitrarily because in our enterprise scenario it is almost always
layers corresponding to the |N | tasks. Relative loss weights were        possible to obtain this many for a given user by extending the date
used to balance the training impact from each task since the tasks        range. We plan to investigate the performance of user-conditioned
had varying numbers of training examples.                                 representations produced from smaller inboxes in future work.

                                                                          5     EXPERIMENTS
4.2    Evaluation Metrics                                                 We show that user-conditioned entity representations outperform
We measure our performance on the association tasks through               strong global model baselines. NMF applied to our version of TF-
accuracy (percentage of successful predictions) and average recall.       IDF matrices proves most effective among the methods surveyed for
A successful prediction is one where the target entity is scored          representing key phrases and contacts. The combination of centroid
highest among all candidates. Since there is one target and nineteen      and pseudoinverse methods detailed in Section 3.2.4 works best for
distractors, random guessing achieves an accuracy of 0.05.                composing email representations. While on some tasks supervised
   Accuracy can allow a small number of frequently occurring              task-specific baseline models achieved higher accuracy than entity
entities to have a disproportionate effect. For instance, in sender       representation similarity-based methods, the latter were competi-
prediction the majority of emails may be from a small set of senders:     tive and had significantly better recall. Task-specific models trained
performance on these senders will skew the results. Thus, we also         1 https://catalog.ldc.upenn.edu/LDC2015T03
PrivateNLP 2020, Feb 7, 2020, Houston, Texas                                        Melnick, Elmessilhy, Polychronopoulos, Lopez, Tu, Khan, Wang, and Quirk


                                   Sender            Recipient         Rel. Phrase          representations provides the best results for accuracy and almost
 Method
                                                                                            matches pseudoinverse for recall.
                                 Acc      Rec       Acc      Rec       Acc      Rec
 TF-IDF                         0.59      0.28     0.59      0.31     0.60      0.41        5.2     Task-Specific Models
 LDA                            0.53      0.37     0.51      0.41     0.49      0.42
                                                                                            Task-Agnostic vs. Task-Specific. Unsupervised, task-agnostic ap-
 LSA                            0.59      0.29     0.59      0.32     0.60      0.42
                                                                                            proaches are versatile and reusable, but they may underperform
 NMF unreg.(λ = 0)              0.61      0.37     0.59      0.40     0.60      0.46
                                                                                            relative to supervised models tuned to specific tasks. As described
 NMF (λ = 0.0001)               0.62      0.40     0.62      0.44     0.66      0.53
                                                                                            in Section 4.1.1, we explore this tradeoff by comparing the perfor-
Table 2: Evaluation task performance of key phrase and contact                              mance of entity representation similarity-based methods against
representation methods. In every case, the tasks use the Centroid                           task-specific baseline models trained on the evaluation tasks. For
method for composing email representations. Avocado data set.
                                                                                            Avocado, we see that while the accuracy is indeed better on task-
                                                                                            specific Pre-trained and Pre-trained + Contacts compared to the best
                                                                                            representation methods (TF-IDF + NMF Centroid and TF-IDF NMF
                                   Sender            Recipient         Rel. Phrase          Centroid + Pseudoinverse), 3 as shown in Table 4. However, the
 Method
                                 Acc      Rec       Acc      Rec       Acc      Rec         TF-IDF + NMF Centroid + Pseudoinverse representations achieved
                                                                                            significantly better recall for all three content prediction tasks and
 Centroid                       0.62      0.40     0.62      0.44     0.66      0.53
                                                                                            better accuracy in key phrase prediction, again indicating their
 Pointwise max                  0.59      0.30     0.59      0.34     0.61      0.42
                                                                                            ability to avoid over-optimizing for frequently occurring entities.
 Pseudoinverse                  0.49      0.56     0.47      0.56     0.55      0.58
                                                                                            This model produces even better results on the enterprise data
 Centroid+Pseudoinv             0.64      0.53     0.62      0.54     0.66      0.53
                                                                                            set, where its accuracy is competitive with both of the Pre-trained
Table 3: Evaluation task performance of email representation meth-                          models and its improvement in recall is even more dramatic. The
ods. In every case, the tasks use regularized NMF to produce key                            higher number of contacts in the enterprise set enables better joint
phrase and contact representations. Avocado data set.
                                                                                            modeling with the content, allowing the entity representations to
                                                                                            perform better in this setting. We can see that unsupervised entity
                                                                                            representations are competitive with supervised baselines.
on entity representations also outperformed task-specific models                               Entity Representations as Input Features. As our results suggest,
trained on baseline features, demonstrating the entity representa-                          user-conditioned entity representations are useful as input features
tions’ value as feature inputs. Our results here also show that entity                      to supervised models. To assess their value as feature representa-
representations are competitive with multitask learning despite the                         tions, we compare task-specific models trained on entity represen-
fact that they are trained without knowledge of the downstream                              tations with task-specific baselines, as described in Section 4.1.1.
tasks. We discuss these results in the following subsections.                               On Avocado, the entity representation-based task-specific models,
                                                                                            TF-IDF + NMF Centroid and TF-IDF NMF Centroid + Pseudoin-
5.1      User-Conditioned Representations                                                   verse, outperform (or in a few cases match) the baselines on every
Slow Changing Entities: Key Phrases and Contacts. We compare                                task and metric. We see similar results on enterprise data, except a
unsupervised methods for producing key phrase and contact repre-                            marginally lower reply prediction PR-AUC with entity-based task-
sentations in Table 2. For LDA, LSA, and NMF, we perform hyper-                             specific models. Comparing the Avocado and enterprise results,
parameter tuning on a single enterprise user and report results for                         we can see that the performance on all tasks is much better on
all techniques with their best settings. Since the evaluation tasks re-                     enterprise users. Our hypothesis is that the larger contact vocabu-
quire representations for email as well as key phrases and contacts,                        lary in enterprise (1,431 contacts per user on average) compared
we use the Centroid email representation in each case to ensure a                           to Avocado (average 210 contacts per user) makes sender and re-
fair comparison. Predictions are based on cosine similarity.2 NMF                           cipient tasks easier: the distractors are sampled from a larger pool
with regularization outperformed all other methods. Regularization                          of contacts, and therefore less likely to be related and easier to
leads to more effective representations for NMF; comparing unreg-                           screen out. In the case of reply prediction, we believe the higher
ularized NMF to LSA suggests that non-negativity is also a helpful                          PR-AUC stems from enterprise users that receive a higher volume
bias. Some of the most substantial gains are in recall, especially                          of machine-generated emails, which have more predictable reply
when compared to sparse TF-IDF baselines.                                                   behavior.
    Composition for Fast Changing Entities: Email. Different compo-
sitional operations for representing email are explored in Table 3.                         5.3     User-Conditioned vs. Global Models
Because NMF performed best across all tasks, we restrict our at-
                                                                                            Each set of user-conditioned representations is trained on much
tention to these representations. The centroid method outperforms
                                                                                            fewer data than most representation learning techniques, but per-
others on accuracy, though the pseudoinverse approach is the best
                                                                                            sonalization is a powerful source of context. While our primary
for recall, presumably because it can incorporate information from
                                                                                            reason for focusing on user-conditioned entity representations is
unigrams in the represented email and not just the key phrases
                                                                                            to avoid privacy leaks, we want to know how they compare against
and contacts. A linear combination of centroid and pseudoinverse
2 Reply prediction is difficult to evaluate in an unsupervised setting; hence, it is not    3 Our results for sender and recipient prediction through an unsupervised task-agnostic

reported here.                                                                              representation are in the same range as those reported by Graus et al., [17] (0.66).
Privacy-Aware Personalized Entity Representations for Improved User Understanding                            PrivateNLP 2020, Feb 7, 2020, Houston, Texas


                                                                                     Sender                Recipient         Related Phrase       Reply
 Data                  Method
                                                                               Accuracy        Recall   Accuracy   Recall   Accuracy    Recall   PR-AUC
                       Unsupervised Similarity-Based Methods
                         TF-IDF + NMF Centroid                                    0.62         0.40       0.62      0.44      0.66       0.53      N/A
                         TF-IDF + NMF Centroid + Pseudoinverse                    0.64         0.53       0.62      0.54      0.67       0.60      N/A
                       Supervised Task-Specific Models
     Avocado users


                         Pre-trained                                              0.72          0.38      0.67      0.31      0.59       0.36      0.21
                         Pre-trained + Contacts                                   0.74          0.42      0.71      0.35      0.60       0.37      0.24
                         TF-IDF + NMF Centroid                                    0.74          0.48      0.72      0.47      0.64       0.49      0.28
                         TF-IDF + NMF Centroid + Pseudoinverse                    0.74          0.49      0.73      0.47      0.67       0.52      0.28
                       Supervised Multi-Task Models
                         Pre-trained                                              0.73          0.47      0.69      0.42      0.59       0.35      0.28
                         Pre-trained + Contacts                                   0.78          0.51      0.75      0.46      0.59       0.35      0.30
                       Unsupervised Similarity-Based Methods
                         TF-IDF + NMF Centroid                                    0.81         0.73       0.86      0.79      0.69       0.60      N/A
   Enterprise users


                         TF-IDF + NMF Centroid + Pseudoinverse                    0.81         0.77       0.86      0.81      0.70       0.65      N/A
                       Supervised Task-Specific Models
                         Pre-trained + Contacts                                   0.83          0.54      0.87      0.50      0.70       0.44      0.71
                         TF-IDF + NMF Centroid                                    0.87          0.68      0.91      0.71      0.72       0.56      0.69
                         TF-IDF + NMF Centroid + Pseudoinverse                    0.87          0.70      0.91      0.72      0.74       0.59      0.65
                       Supervised Multi-Task Models
                         Pre-trained + Contacts                                   0.85          0.58      0.88      0.54      0.70       0.43      0.72
                            Table 4: Task-specific models trained using representations as features, for both enterprise and Avocado users.


non-privacy-aware “global” representations trained on data from ev-
ery user in an organization. In Table 5 we see that user-conditioned
representations are significantly better on all tasks across all met-
rics compared to the global versions of those representations. This
indicates that, for these models, the local context of a user is more
important than training on a larger data set. We see a similar trend
with the Pre-trained + Contacts and Global Pre-trained + Contacts
models, though the global variant outperforms the user-conditioned
one in sender prediction on Avocado. On reply prediction, global                         Figure 4: Sender prediction accuracy vs. number of training emails
models trained using representations perform similarly to Yang et                        for TF-IDF + NMF Centroid on Avocado.
al. [40] without any task-specific feature engineering.
                                                                                         Thus the unsupervised methods presented here are competitive
                                                                                      with multitask learning on recall despite the fact that they are
5.4                   Unsupervised vs. Multi-Task Approaches                          trained without knowledge of the downstream tasks, and the task-
Our primary focus has been unsupervised entity representation                         specific entity-based models are competitive with the multi-task
computation. An alternative approach is to induce representations                     method on accuracy and better on recall.
in a multitask learning setting [9]. Multitask models often achieve
better performance than separate models trained on the same tasks                        5.5    The Effect of Data Size and Dimension
and, indeed, as seen in Table 4, the multitask model described in                     To explore the impact of data density, Figure 4 plots sender pre-
Section 4.1.1 outperforms task specific models trained on the same                    diction accuracy using TF-IDF + NMF Centroid representations
Pre-trained + Contacts feature representation.                                        against the number of emails in a user’s mailbox. Accuracy does
   On Avocado, the best multitask model achieves significantly bet-                   not vary substantially, though average recall improves: additional
ter accuracy in sender and recipient prediction than Pre-trained and                  data benefits representing entities on average. Similar trends for
Pre-trained + Contacts methods; entity-based task-specific methods                    other tasks and other models were observed.
are still competitive on recall. We observe the same trend with enter-                   We plot the effect of varying dimension sizes for all tasks using
prise, where multi-task models outperform task-specific Pre-trained                   the TF-IDF + NMF Centroid approach in Figure 5 for Avocado users.
+ Contacts, though entity-based task-specific models outperform                       Representations of dimension 400 and 500 consistently achieve best
multi-task on all tasks and metrics except reply prediction PR-AUC.                   results for both accuracy and recall.
PrivateNLP 2020, Feb 7, 2020, Houston, Texas                                Melnick, Elmessilhy, Polychronopoulos, Lopez, Tu, Khan, Wang, and Quirk


                                                                                    Sender                     Recipient               Related Phrase              Reply
 Data                   Method
                                                                              Accuracy        Recall     Accuracy        Recall      Accuracy        Recall      PR-AUC
                        Unsupervised Similarity-Based Methods
                          TF-IDF + NMF Centroid                                  0.62          0.40          0.62          0.44         0.66          0.53          N/A
                          TF-IDF + NMF Centroid + Pseudoinverse                  0.64          0.53          0.62          0.54         0.67          0.60          N/A
                          Global TF-IDF + NMF                                    0.50          0.29          0.41          0.27         0.45          0.30          N/A
      Avocado users


                          Global TF-IDF + NMF Centroid + Pseudoinverse           0.55          0.40          0.40          0.34         0.43          0.37          N/A
                        Supervised Task-Specific Models
                          Pre-trained + Contacts                                 0.74          0.42          0.71          0.35         0.60           0.37         0.24
                          TF-IDF + NMF Centroid                                  0.74          0.48          0.72          0.47         0.64           0.49         0.28
                          TF-IDF + NMF Centroid + Pseudoinverse                  0.74          0.49          0.73          0.47         0.67           0.52         0.28
                          Global Pre-trained + Contacts                          0.77          0.63          0.65          0.48         0.58           0.34         0.21
                          Global TF-IDF + NMF Centroid                           0.70          0.50          0.58          0.36         0.52           0.29         0.25
                          Global TF-IDF + NMF Centroid + Pseudoinverse           0.71          0.52          0.57          0.37         0.53           0.30         0.19
                        Unsupervised Similarity-Based Methods
                          TF-IDF + NMF Centroid                                  0.81          0.73          0.86          0.79          0.69         0.60          N/A
                          TF-IDF + NMF Centroid + Pseudoinverse                  0.81          0.77          0.86          0.81          0.70         0.65          N/A
                          Global TF-IDF + NMF                                    0.50          0.49          0.45          0.30          0.45         0.41          N/A
    Enterprise users


                          Global TF-IDF + NMF Centroid + Pseudoinverse           0.48          0.51          0.43          0.34          0.45         0.43          N/A
                        Supervised Task-Specific Models
                          Pre-trained + Contacts                                 0.83          0.54          0.87          0.50         0.70           0.44         0.71
                          TF-IDF + NMF Centroid                                  0.87          0.68          0.91          0.71         0.72           0.56         0.69
                          TF-IDF + NMF Centroid + Pseudoinverse                  0.87          0.70          0.91          0.72         0.74           0.59         0.65
                          Global Pretrained + Contacts                           0.80          0.61          0.77          0.44         0.63           0.46         0.65
                          Global TF-IDF + NMF Centroid                           0.61          0.49          0.47          0.25         0.49           0.34         0.67
                          Global TF-IDF + NMF Centroid + Pseudoinverse           0.60          0.51          0.48          0.30         0.50           0.36         0.56
                                               Table 5: Individual vs. global models on Avocado and enterprise users.


                                                                                        approaches are privacy preserving and show substantial benefits
                                                                                        over global models, despite their lower data density. These promis-
                                                                                        ing results suggest a range of future directions to explore. One
                                                                                        clear next step is to extend our approach to include documents,
                                                                                        meetings, and other enterprise entities. Beyond that, embedding
                                                                                        relationships between entities could help in predicting more com-
                                                                                        plex connections between them. Next, our explorations in multitask
                                                                                        modeling suggest that generalization across tasks also has value.
                                                                                        Evaluating the impact of multitask representations on new tasks
         Figure 5: Effect of dimensionality on entity representations.
                                                                                        through leave-one-out experiments may help quantify this.
5.6                    Practical Implications
Our current implementation has several optimizations intended for
                                                                                        REFERENCES
                                                                                        [1] Q. Ai, V. Azizi, X. Chen, and Y. Zhang. Learning heterogeneous knowledge base
a production environment. We maintain updates to the TF-IDF ma-                             embeddings for explainable recommendation. Algorithms, 11(9):137, 2018.
trix in a streaming manner upon receipt of each email. A periodic                       [2] Q. Ai, Y. Zhang, K. Bi, X. Chen, and W. B. Croft. Learning a hierarchical embedding
task, run every few days to every week, computes a fresh NMF rep-                           model for personalized product search. In Proceedings of the 40th International
                                                                                            ACM SIGIR Conference on Research and Development in Information Retrieval,
resentation, using approximately one minute of computation time                             pages 645–654, 2017.
per user with an optimized implementation based on Sparse BLAS                          [3] N. O. Amer, P. Mulhem, and M. Géry. Toward word embedding for personalized
                                                                                            information retrieval. CoRR, abs/1606.06991, 2016.
operations in Intel MKL. [20] The process is running constantly for                     [4] R. Balasubramanyan, V. R. Carvalho, and W. Cohen. Cutonce-recipient rec-
thousands of users, scaling up to hundreds of thousands of users.                           ommendation and leak detection in action. In AAAI, Workshop on Enhanced
                                                                                            Messaging, 2008.
                                                                                        [5] Y. Bao, H. Fang, and J. Zhang. TopicMF: Simultaneously exploiting ratings and
6                CONCLUSIONS                                                                reviews for recommendation. In Twenty-Eighth AAAI Conference on Artificial
We have demonstrated approaches for learning task-agnostic user-                            Intelligence, 2014.
                                                                                        [6] P. N. Bennett and J. G. Carbonell. Combining probability-based rankers for action-
conditioned embeddings that outperform strong baselines and                                 item detection. In Human Language Technologies: The Conference of the North
demonstrate value in a range of downstream tasks. User-conditioned                          American Chapter of the Association for Computational Linguistics; Proceedings of
Privacy-Aware Personalized Entity Representations for Improved User Understanding                                        PrivateNLP 2020, Feb 7, 2020, Houston, Texas


     the Main Conference, pages 324–331, 2007.                                                [36] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Rep-
 [7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn.          resenting text for joint embedding of text and knowledge bases. In EMNLP,
     Res., 3:993–1022, Mar. 2003.                                                                  2015.
 [8] A. Bratko, B. Filipič, G. V. Cormack, T. R. Lynam, and B. Zupan. Spam filtering          [37] C. Van Gysel, B. Mitra, M. Venanzi, R. Rosemarin, G. Kukla, P. Grudzien, and
     using statistical data compression models. JMLR, 7:2673–2698, Dec. 2006.                      N. Cancedda. Reply with: Proactive recommendation of email attachments.
 [9] R. Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997.                         In Proceedings of the 2017 ACM on Conference on Information and Knowledge
[10] M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M.            Management, CIKM, pages 327–336, New York, NY, USA, 2017.
     Dai, Z. Chen, T. Sohn, and Y. Wu. Gmail smart compose: Real-time assisted                [38] J. B. P. Vuurens, M. Larson, and A. P. de Vries. Exploring deep space: Learning
     writing. In Proceedings of the 25th ACM SIGKDD International Conference on                    personalized ranking in a semantic space. In Proceedings of the 1st Workshop on
     Knowledge Discovery &#38; Data Mining, KDD ’19, pages 2287–2295, New York,                    Deep Learning for Recommender Systems, DLRS 2016, 2016.
     NY, USA, 2019. ACM.                                                                      [39] L. Y. Wu, A. Fisch, S. Chopra, K. Adams, A. Bordes, and J. Weston. Starspace:
[11] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. In-               Embed all the things! In AAAI, 2018.
     dexing by latent semantic analysis. Journal of the American Society for Information      [40] L. Yang, S. T. Dumais, P. N. Bennett, and A. H. Awadallah. Characterizing and
     Science, 41(6):391–407, 1990.                                                                 predicting enterprise email reply behavior. In Proceedings of the 40th International
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of                       ACM SIGIR Conference on Research and Development in Information Retrieval,
     deep bidirectional transformers for language understanding. arXiv preprint                    pages 235–244, 2017.
     arXiv:1810.04805, 2018.                                                                  [41] S. Yoo, Y. Yang, F. Lin, and I.-C. Moon. Mining social networks for personal-
[13] F. Diaz, B. Mitra, and N. Craswell. Query expansion with locally-trained word                 ized email prioritization. In Proceedings of the 15th ACM SIGKDD international
     embeddings. CoRR, abs/1605.07891, 2016.                                                       conference on Knowledge discovery and data mining, pages 967–976, 2009.
[14] J. B. Diederik Kingma. Adam: A method for stochastic optimization. In
     arXiv:1412.6980, 2014.
[15] M. Dredze, T. Brooks, J. Carroll, J. Magarick, J. Blitzer, and F. Pereira. Intelligent
     email: Reply and attachment prediction. In Proceedings of the 13th International
     ACM Conference on Intelligent user interfaces, pages 321–324, 2008.
[16] J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle for
     the inbox. Communications of the ACM, 50(2):24–33, 2007.
[17] D. Graus, D. Van Dijk, M. Tsagkias, W. Weerkamp, and M. De Rijke. Recipient
     recommendation in enterprises using communication graphs and email content.
     In Proceedings of the 37th international ACM SIGIR conference on Research &
     development in information retrieval, pages 1079–1082, 2014.
[18] K. S. Hasan and V. Ng. Automatic keyphrase extraction: A survey of the state
     of the art. In Proceedings of the 52nd Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers), pages 1262–1273, Baltimore,
     Maryland, June 2014. Association for Computational Linguistics.
[19] G. Hu. Personalized neural embeddings for collaborative filtering with text. arXiv
     preprint arXiv:1903.07860, 2019.
[20] Intel. Intel Math Kernel Library. Reference Manual. Intel Corporation, Santa Clara,
     USA, 2009. ISBN 630813-054US.
[21] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor
     factorizations: a unified view based on block coordinate descent framework.
     Journal of Global Optimization, 58(2):285–319, Feb 2014.
[22] B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification
     research. In European Conference on Machine Learning. Springer, 2004.
[23] S. Kottur, X. Wang, and V. Carvalho. Exploring personalized neural conversational
     models. In IJCAI, pages 3728–3734, 2017.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
     convolutional neural networks. In Advances in neural information processing
     systems, pages 1097–1105, 2012.
[25] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix
     factorization. Nature, 401:788–791, 1999.
[26] M. Levit, A. Stolcke, R. Subba, S. Parthasarathy, S. Chang, S. Xie, T. Anastasakos,
     and B. Dumoulin. Personalization of word-phrase-entity language models. In
     Proc. Interspeech, pages 448–452. ISCA - International Speech Communication
     Association, September 2015.
[27] S. Liang, X. Zhang, Z. Ren, and E. Kanoulas. Dynamic embeddings for user pro-
     filing in twitter. In Proceedings of the 24th ACM SIGKDD International Conference
     on Knowledge Discovery & Data Mining, pages 1764–1773, 2018.
[28] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed represen-
     tations of words and phrases and their compositionality. In Advances in Neural
     Information Processing Systems.
[29] W. Nawaz, Y. Han, K.-U. Khan, and Y.-K. Lee. Personalized email community
     detection using collaborative similarity measure. arXiv preprint:1306.1300, 2013.
[30] T. Nguyen and A. Takasu. Npe: neural personalized embedding for collaborative
     filtering. arXiv preprint arXiv:1805.06563, 2018.
[31] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word repre-
     sentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
     Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, Oct. 2014. Associa-
     tion for Computational Linguistics.
[32] A. Rattinger, J.-M. L. Goff, and C. Gütl. Local word embeddings for query expan-
     sion based on co-authorship and citations. In BIR@ECIR, 2018.
[33] M. Rudolph, F. Ruiz, S. Mandt, and D. Blei. Exponential family embeddings. In
     Advances in Neural Information Processing Systems, pages 478–486, 2016.
[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
     Dropout: A simple way to prevent neural networks from overfitting. Journal of
     Machine Learning Research, 15:1929–1958, 2014.
[35] G. Tang, J. Pei, and W.-S. Luk. Email mining: tasks, common techniques, and
     tools. Knowledge and Information Systems, 41(1):1–31, 2014.