<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Feb</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Privacy-Aware Personalized Entity Representations for Improved User Understanding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>lemeln</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>huahme</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vapolych</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gilopez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yuantu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>omkhan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>yeyiwang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>chrisq}@microsoft.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>7</volume>
      <issue>2020</issue>
      <abstract>
        <p>Representation learning has transformed the field of machine learning. Advances like ImageNet, word2vec, and BERT demonstrate the power of pre-trained representations to accelerate model training. The efectiveness of these techniques derives from their ability to represent words, sentences, and images in context. Other entity types, such as people and topics, are crucial sources of context in enterprise use-cases, including organization, recommendation, and discovery of vast streams of information. But learning representations for these entities from private data aggregated across user shards carries the risk of privacy breaches. Personalizing representations by conditioning them on a single user's content eliminates privacy risks while providing a rich source of context that can change the interpretation of words, people, documents, groups, and other entities commonly encountered in workplace data. In this paper, we explore methods that embed user-conditioned representations of people, key phrases, and emails into a shared vector space based on an individual user's emails. We evaluate these representations on a suite of representative communication inference tasks using both a public email repository and live user data from an enterprise. We demonstrate that our privacy-preserving lightweight unsupervised representations rival supervised approaches. When used to augment supervised approaches, these representations are competitive with deep-learned multi-task models based on pre-trained representations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Pre-trained embeddings are a crucial technique in machine
learning applications, especially when task-specific training data is
scarce. For instance, groundbreaking work in image captioning
was enabled by reusing the penultimate layer of an object
recognition system to summarize the content of an image[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. More
recently, contextualized embeddings are setting the state-of-the-art
in a range of natural language processing tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Training
models to extract reusable representations from data is now an obvious
investment. The next key research question is which context to
leverage.
      </p>
      <p>Our research is situated in the area of User Understanding:
organizing the information, documents, and communications that
are available to each user within an organization. Users now
commonly retain huge mailboxes of written communication; members
of larger organizations also have access to large repositories of
access-controlled documents that are not publicly available. Our
goals are to help each user find, classify, and act upon these
growing information stores, then to acquire and organize information,
including facts and relationships among these entities. A crucial
enabling step is to build reusable representations of this information.</p>
      <p>Most representation learning uses large, publicly-available
document stores to build generic embeddings. We believe there is also
great value in user-conditioned representations: representations of
phrases and contacts for each user learned on the information
uniquely available to that user. First, building user-conditioned
representations provides a huge amount of context. Often when there
are ambiguous or overloaded concepts, the key people surrounding
their usage can disambiguate. Furthermore, a given user may extend
the meanings of a given concept as they document and
communicate new ideas. Perhaps most importantly, training a model based
on only the communications and documents available to a given
user provides a clear and intuitive notion of privacy. Whenever we
train on data beyond any user’s normal visibility, there is some
potential for capturing and surfacing information outside their view.
Diferential Privacy helps limit the exposure of any individual user,
but preventing leakage across groups is more dificult. For instance,
certain privileged information may be discussed heavily by many
members of an administrative board, yet this information should
not be shared broadly across the whole organization. When training
a user’s model on only data that that user can see, the possibility
for leaking information is removed. From the perspectives of both
leveraging a crucial signal as well as maintaining user privacy and
trust, user-conditioned representations hold great promise.</p>
      <p>
        User-conditioned learning comes at a cost. Data density
decreases dramatically. State-of-the-art deep learned representations
typically train on billions of tokens [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], whereas an individual
user’s inbox may only have a few thousand emails. Thus, we
explore shallower personalized approaches with lower sample
complexity (though shallow models can be mixed with deep generic
models for empirical gains [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Furthermore, training must be
performed for every user separately within the organization; in our
case, this entails separate training runs for hundreds of millions of
users. Because the information available to the user is constantly
changing, maintaining fresh representations is also a challenge.
      </p>
      <p>As computation and storage become cheaper, the overhead of
maintaining user-conditioned models is tractable only if the models
are light-weight. Furthermore, we focus on task-agnostic
representations that benefit a range of scenarios, amortizing the cost of
computation. Finally, using models trained only on one user’s data
benefits privacy, which is an increasing concern for organizations
and individuals.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Contributions</title>
      <p>We present the first eforts in building per-user representations:
content-based representations that embed disparate entities,
including contacts and key phrases, into a common vector space. These
entity representations are diferent for each user: the same key
phrase and contact may have very diferent representations across
two users depending on their context. We focus on slow-changing
entities like contacts and key phrases to minimize the impact of
delayed retraining: although one’s impression of their collaborators
may shift over years, months, or perhaps weeks, a representation
that is a few days old is still useful. To embed rapidly arriving
and changing items such as documents and emails, we present
approaches that assemble representations of rapidly changing entities
from their content, including related contacts and key phrases.</p>
      <p>
        We evaluate these representations on a range of downstream
tasks, including action-prediction and content-prediction. Simple,
unsupervised approaches, especially non-negative matrix
factorization (NMF) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], produce substantial improvements in accuracy,
outperforming task-specific and multi-task neural network approaches.
      </p>
      <p>We compare user-conditioned representations to representations
learned at the organization-level, where data from multiple users
are combined together into a larger undiferentiated store.
Userconditioned representations mostly outperform the
organizationlevel approaches, despite decreased data density, presumably
because the additional context provides helpful signals to models.
Furthermore, user-conditioned representations sidestep issues related
to privacy preservation by not mixing data across user mailboxes.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Email mining [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] has been widely studied from diferent angles
both for content and action classicfiation. Spam detection has
received considerable attention both from a content identification
and filtering view [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as well as from a process perspective [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Folder prediction is another task that can help better organize
incoming emails [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Email content has also been used for social
graph analysis to learn associations between people, both for
recipient prediction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and sender prediction [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Action prediction
tasks that have been considered in the context of emails include
reply prediction [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ], attachment prediction [
        <xref ref-type="bibr" rid="ref15 ref37">15, 37</xref>
        ], and generic
email action prediction [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this paper, we use recipient
prediction, sender prediction, and reply prediction as representative
tasks to evaluate the quality of our learned representations. All
prior work on these and similar tasks has relied on per-task feature
engineering and supervised model training. We show how entity
representations generated in a task-agnostic manner can be used
both in an unsupervised and in a supervised setting for these tasks.
      </p>
      <p>
        Entity representations have been used extensively to provide
personalized recommendations. Most such models build global
representations of users and items in the same latent space and then
determine the similarity between the user and item through cosine
similarity. The user embeddings can be built in a collaborative
filtering setting by leveraging a user’s past actions such as clicks [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ],
structured attributes items interacted with, ratings ofered on past
items [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or even past search queries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. An extension to such
approaches is to combine embeddings of words or phrases with
other types of data [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], such as embeddings of users [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] or their
previously favored items [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our entity representations also embed
phrases with contacts, but they are task-agnostic.
      </p>
      <p>
        Personalized language models have shown benefits in speech
recognition [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], dialog [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], information retrieval [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], and
collaborative filtering [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. These approaches model the user, but the
representations are not generated on a per-user level; instead, all
users share the same representation for an item or phrase. In our
approach, the same item (phrase or contact) will have a diferent
representation according to each user, since the entity
representations are generated per user only considering the data available to
that user. Nawaz et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] present a technique to perform social
network analysis to identify similar communities of contacts within
a user’s own data but their approach is limited to a single task. Yoo
et al. [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] describe an approach for obtaining representations of
emails using personalized network connectivity features and
contacts. Their representations are used as inputs to machine learning
models to predict the importance of emails. In our approach, key
phrases, contacts, and emails are all embedded in the same space.
Thus, we can use the task-agnostic representations for a variety of
tasks and obtain similarity scores for any pair of entities.
      </p>
      <p>
        Starspace [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] introduces a neural-based embedding method that
maps diferent types of entities (graphs, words, sentences,
documents, etc.) to the same space. While the entities are embedded
in the same space, like our work, their training uses global
information, and the loss function on which the network is trained is
task-dependent. Our approach allows reusing the same
representations obtained using local data across a variety of tasks.
      </p>
      <p>
        Our evaluation tasks bear some similarity to the evaluation of
knowledge base completion through embeddings of text and
entities [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. However, our entities are not curated knowledge base
entities; they are phrases and people known to a particular user.
      </p>
      <p>
        In query expansion, locally trained embeddings can outperform
global or pre-trained embeddings [
        <xref ref-type="bibr" rid="ref13 ref32">13, 32</xref>
        ] by incorporating more
relevant local context. We exploit a similar insight in training only
on the user’s own data and directly incorporating her context. Amer
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] present an approach that trains a per-user representation,
though their per-user embeddings perform worse than global or
pre-trained embeddings. In this paper, we demonstrate methods to
train per-user representations that can not only outperform global
representations but also generate them in a task-agnostic manner.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>ENTITY REPRESENTATIONS</title>
      <p>
        We developed representations for three entity types: key phrases,
contacts, and emails. Key phrases (typically noun phrases) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] can
appear anywhere in the body or subject of an email. Restricting
to extracted key phrases limits the total number of entities for
which representations must be learned. By “contacts,” we refer
to individual email addresses that appear in the From, To, or CC
ifelds of an email. For slow-changing entities like key phrases and
contacts, we can periodically regenerate a stored representation. We
represent fast-changing entities such as emails with light-weight
compositions of pre-trained entity embeddings (contact and key
phrase embeddings) to minimize computational expense.
      </p>
      <p>
        Since the median inbox size will be small (in both our data sets it
is around seven thousand emails) there is not enough data per user
to train useful deep learned representations. Indeed, our early
attempts to train user-conditioned word2vec [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] embeddings yielded
poor results and are not reported here. Therefore we only
considered approaches that were likely to perform well at low data density.
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Key Phrase and Contact Representations</title>
      <p>de,f =
We compute unsupervised entity representations for contacts and
key phrases by associating them with concatenated documents.
These concatenated documents are assembled from the user’s
original documents; the emails in our experiments have From, To, CC,
Body, and Subject fields. Given a particular entity e, its
concatenated document de is the concatenation of every email m in a user’s
inbox such that e appears in any of those fields. This concatenation
is done on each field f independently: every new concatenated
document de will have a corresponding field de,f for each field f
in the original document. We use É to denote concatenation.
Ê
mf , where Me = {m : ∃f such that e ∈ mf }
(1)
m ∈Me
Stop words on the Scikit-learn stopword list are removed, as are
terms that appear in more than 30% or less than 0.25% of training
emails. We then generate a sparse numerical entity-by-term matrix,
using TF-IDF for most methods or just TF matrix in the case of LDA.
Initially one matrix is computed for each field f using the relevant
portion of the concatenated document collection Df = {de,f }e .
Each matrix is scaled according to a weighting factor wf to balance
its contributions, and finally these matrices are concatenated to
form a single matrix T .</p>
      <p>T = Ê pwf · term-matrix(Df ), where Õ wf = 1 (2)
f
f
The weights of the diferent email fields are treated as
hyperparameters and tuned empirically to perform well on the evaluation
tasks. We found that weights of 0.4, 0.3, 0.2, 0.05, and 0.05 for the
Body, Subject, From, To, and CC fields worked well. The rows of T
are sparse representations of entities – a simple and safe baseline.
We explored LDA, LSA, and NMF as a means of encouraging softer
matching through dimensionality reduction.</p>
      <p>3.1.1 TF-IDF. Our baseline representation technique is sparse
unigram TF-IDF vectors produced from the concatenated
documents.</p>
      <p>
        3.1.2 Latent Dirichlet Allocation (LDA). Latent topic models
using LDA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] over the term frequency matrix of the concatenated
documents (not the TF-IDF matrix) learn a mixture of topics for
each document. These learned vectors can act as entity
representations. We can vary the number of latent topics to determine the
dimensionality of the resulting embeddings.
      </p>
      <p>
        3.1.3 Latent Semantic Analysis (LSA). A classic method for
reducing sparsity, LSA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] builds a low-rank approximation of a
TFIDF matrix T using the singular value decomposition: T = U ΣV T .
      </p>
      <p>3.1.4 Non-negative Matrix Factorization (NMF). The SVD
reconstruction has a few problems: the values of the matrix may be
positive or negative, and there is no explicit regularization term.
Together, these issues may lead to strange or divergent weights,
especially when the data is dificult to model with lower rank.
Non-negative matrix factorization (NMF) addresses these issues by
E
m
a
li
s
C
o
n
t
a
c
ts
+
k
e
y
p
h
r
a
s
e
s
C
o
n
t
a
c
ts
+
k
e
y
p
h
r
a
s
e
s
the
item
…
Words + contacts + key phrases
mail1
mail2
…
mailn
due date
4
2
8
the
0.1
0.2
0.3
0
1Original corpus:1
0TF-IDF matrix 0
⋱</p>
      <p>1
item
0.3
0
0</p>
      <p>Reduce dimensionality of
concatenated document matrix</p>
      <sec id="sec-5-1">
        <title>Words + contacts + key phrases</title>
        <p>
          ⋱
H
0.25
0
constraining the low-rank matrices to be positive and adding a
regularization term [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Specifically, given an input matrix T ∈ Rm×n ,
we try to find matrices W ∈ Rm×d and H ∈ Rd×n to minimize
|T − W H | + λ (|W | + |H |)
(3)
where λ is a regularization weight and | · | is the Frobenius norm.
The W matrix serves as a representation for the entities. We
efifciently compute NMF through the Hierarchical Least Squares
algorithm [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
3.2
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Email Representations</title>
      <p>The vocabulary of key phrases and contacts in one’s mailbox is
likely to grow slowly, and their meanings and relationships will also
evolve gradually. By comparison, many new emails arrive every
day, so the “vocabulary” of email entities is constantly increasing.
Thus, while it is possible to train representations for email in the
same way that we do for key phrases and contacts, updating email
representations on an ongoing basis would imply vast storage and</p>
      <sec id="sec-6-1">
        <title>From:</title>
      </sec>
      <sec id="sec-6-2">
        <title>Subject: To: Cc:</title>
      </sec>
      <sec id="sec-6-3">
        <title>Date:</title>
        <p>Ken,
Replied to: False</p>
      </sec>
      <sec id="sec-6-4">
        <title>I wanted to let you know that I spoke with W.O. King at the Baker Institute earlier</title>
        <p>today. Everything is ready for tomorrow's activities.</p>
        <p>There will be a great deal of media coverage for the event. Many media outlets will
be accessing the live video feed of Chairman Greenspan's speech. However, the
media will not be able to interview any participants directly. Chairman Greenspan
will only answer written questions submitted by the audience (including media), and
those questions will be vetted twice. I will be receiving the media advisory that has
been distributed and the most current media attendee list from the Baker Institute
tomorrow. I will provide that to you as early as possible.</p>
      </sec>
      <sec id="sec-6-5">
        <title>You probably know that, because of his position, Chairman Greenspan can not</title>
        <p>Figauccreept2th:eEpxriazemitspellfe, beumtoanliyl twhei"thhonosre" nofdbeering, narmecedipaineEnntrson,Prkizeeyrecpiphiernats.es ,</p>
      </sec>
      <sec id="sec-6-6">
        <title>For that reason, the Enron Prize will not be present on stage during the ceremony.</title>
        <p>and replied-to annotations. Each task is constructed by obscuring
a rhKealavereevnaanDnyetnqeunnees,ttCiiohtnyrsi,sottirhenPeeanetdriackddaintidonIawl iilnlfboermatatetinodninpgri.oPrletoasteheleetvmenmetk.anionwiinfygocuontext.</p>
        <p>reconstructing it given the re</p>
        <p>See you there,
comTerpriuetation requirements. So we handle emails diferently,
computing representations on demand through compositions of other
entity representations. In this paper, we explored four diferent
email composition models: Centroid, Pointwise Max,
Pseudoinverse, and the combination of Centroid and Pseudoinverse.</p>
        <p>3.2.1 Centroid. One simple email representation is the average
of the representations of all key phrases and contacts in an email.</p>
        <p>3.2.2 Pointwise Max. Another commonly used pooling
operation is max – we retain the largest value along each dimension. This
approach increases the sensitivity to strongly-weighted features in
the underlying key phrase and contact representations.</p>
        <p>3.2.3 Pseudoinverse. The H matrix from Equation 3 can serve
as a map from the low-rank concept space into the word/entity
space. Although H is not a square matrix and hence not invertible,
the Moore-Penrose pseudoinverse of H , namely H +, can act as a
map from email content into the entity representation space. We
multiply the TF-IDF vector associated with a given email by H + to
project into the entity representation space. Unlike the previous
two models, this has the benefit of including information from
non-key phrase unigrams from the email.</p>
        <p>3.2.4 Centroid + Pseudoinverse. Centroid and pseudoinverse
representations are summed to combine the benefits of each.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4 EVALUATION METHODOLOGY</title>
    </sec>
    <sec id="sec-8">
      <title>4.1 Evaluation Tasks</title>
      <p>We evaluate entity representations according to their performance
on four email mining tasks: sender prediction, recipient prediction,
related key phrase prediction, and reply prediction. The first three
tasks are content prediction tasks, whereas in reply prediction we
use the email content to predict a user action.</p>
      <p>Content prediction tasks are formulated as association tasks. We
remove a target entity from an email and randomly select nineteen
distractor entities from the user’s inbox not already present in the
(a)
(b)
(c)
|V|
li
a
m
E
500
p
e
r
500
|V|
li
a
m
E
n
o
i
email. We use the cosine similarity between the email
representation and the twenty candidate target entity representations to
predict the true target. Reply prediction is treated as a binary
classification problem, using email representations as input features.
Entity representation methods that yield more accurate predictions
are considered superior.</p>
      <p>These tasks readily suggest real life applications. Recipient
recommendation is already a standard feature in many email clients.
Similarly, an email client may predict whether an email from an
unfamiliar address comes from a known sender and prompt the
user to add the new address to that sender’s contact information.
Predicting latent associations between emails and key phrases
enables automatic topic tagging and foldering. Finally, an email client
may use reply prediction to identify important emails to which an
inbox owner has not yet responded and remind the user to reply.</p>
      <p>4.1.1 Task-Specific Model Architectures. We aim to construct
task-agnostic user-conditioned representations: they should be
useful across a variety of tasks without having to be tuned to each
one separately. While this makes the representations reusable and
reduces computational expense, separate models trained on each
specific task often perform better. To evaluate this tradeof, we
compare the unsupervised similarity-based method described above
to supervised task-specific baseline models trained on each of the
association tasks. We also evaluate how well the user-conditioned
representations perform as feature inputs to task-specific models,
since their utility as feature representations is a key consideration.</p>
      <p>To train a task-specific model for sender, recipient, or key phrase
prediction, we reformulate these association tasks as classification
problems. In each case, we train the classifier to predict the target
entity using its email representation. As above, we remove a target
entity from an email and select nineteen distractors. Instead of
cosine similarity, we use the trained classifier to score the twenty
candidate entities and predict the one with the highest score.</p>
      <p>
        We experimented with a variety of modeling techniques for both
task-specific baseline models and task-specific models trained on
entity representations. The best results consistently came from
simple two-layer feed forward neural classifiers using ReLU
activations, a sigmoid output layer, batch normalization, drop out [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],
and trained using cross-entropy loss and Adam [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, each
scenario achieved best results using slightly diferent task
formulations and architectures.
      </p>
      <p>
        The baseline models were formulated as multiclass classifiers,
as depicted in Figure 3a. Emails are represented as binary
vectors with each element representing the presence or absence of a
unigram or contact. These vectors index into a 300 dimensional
embedding layer initialized with pre-trained GloVe vectors [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ];
outof-vocabulary items received random initializers. The embedding
layer was also trained, allowing the model to learn representations
for out-of-vocabulary terms. We experimented with two variants:
one in which contacts were included as features (“Pre-trained +
Contacts”) and one in which they were not (“Pre-trained”).
      </p>
      <p>For models trained on entity representations, shown in Figure 3b,
we found the best results by treating the candidate target entity
representations and the email representations as separate inputs.
These 500 dimensional representations are passed through two
dense layers of width 500 and 300 respectively and a sigmoid
output layer, which returns a score representing the likelihood that
the input entity is indeed present in the input email. In Tables 4
and 5, “TF-IDF + NMF Centroid” and “TF-IDF + NMF Centroid +
Pseudoinverse” model variants both share this architecture.</p>
      <p>We also considered a multitask model jointly trained on all four
evaluation tasks. This model, shown in Figure 3c, is identical in
its architecture and training to the task-specific baseline model in
Figure 3a except that instead of one output layer it has |N | output
layers corresponding to the |N | tasks. Relative loss weights were
used to balance the training impact from each task since the tasks
had varying numbers of training examples.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation Metrics</title>
      <p>We measure our performance on the association tasks through
accuracy (percentage of successful predictions) and average recall.
A successful prediction is one where the target entity is scored
highest among all candidates. Since there is one target and nineteen
distractors, random guessing achieves an accuracy of 0.05.</p>
      <p>Accuracy can allow a small number of frequently occurring
entities to have a disproportionate efect. For instance, in sender
prediction the majority of emails may be from a small set of senders:
performance on these senders will skew the results. Thus, we also</p>
      <sec id="sec-9-1">
        <title>Emails/User Phrases/User Contacts/User Reply Rate</title>
      </sec>
      <sec id="sec-9-2">
        <title>Avocado (55 users)</title>
      </sec>
      <sec id="sec-9-3">
        <title>Enterprise (53 users)</title>
        <p>Max
report the average recall, an eficient measure for skewed
distributions. To obtain the average recall, we calculate the recall for
each possible target: the percentage of times it was successfully
predicted. We then report the average recall over all targets
without weighing the frequency of the target. Together, accuracy and
average recall provide a reliable measure of the association. If one
method boosts accuracy by only learning about frequent targets,
the average recall will be impacted negatively. Similarly a reduced
recall of the frequent targets will impact the accuracy.</p>
        <p>For reply prediction, we report the area under the precision-recall
curve (PR-AUC), which is useful even when classes are imbalanced.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Evaluation Corpora</title>
      <p>We evaluate our techniques on two separate repositories of emails,
Avocado emails and live user emails from a large enterprise. The
properties for each corpus are listed in Table 1. For the first
repository, we use mailboxes from the Avocado Research Email
Collection1. For the second dataset, we use live user email data from a
real-world enterprise with thousands of users (called enterprise
users from here on for brevity). These emails are encrypted and
oflimits to human inspection. We randomly select a set of users who
are related to each other by sampling from the same department.
This increases the possibility of overlap between users and allows
some shared context. This property will be helpful when we want
to compare a global model versus user-conditioned representations.</p>
      <p>For both datasets, we filter out users with fewer than 3,500 or
greater than 20,000 emails. Users with more than 20,000 emails were
outliers and, in the enterprise dataset, were likely to have many
machine generated emails, which can make the evaluation tasks
easier. We set the minimum number of emails to 3,500 somewhat
arbitrarily because in our enterprise scenario it is almost always
possible to obtain this many for a given user by extending the date
range. We plan to investigate the performance of user-conditioned
representations produced from smaller inboxes in future work.
5</p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS</title>
      <p>We show that user-conditioned entity representations outperform
strong global model baselines. NMF applied to our version of
TFIDF matrices proves most efective among the methods surveyed for
representing key phrases and contacts. The combination of centroid
and pseudoinverse methods detailed in Section 3.2.4 works best for
composing email representations. While on some tasks supervised
task-specific baseline models achieved higher accuracy than entity
representation similarity-based methods, the latter were
competitive and had significantly better recall. Task-specific models trained</p>
      <sec id="sec-11-1">
        <title>1https://catalog.ldc.upenn.edu/LDC2015T03</title>
      </sec>
      <sec id="sec-11-2">
        <title>Sender Recipient Rel. Phrase</title>
        <p>on entity representations also outperformed task-specific models
trained on baseline features, demonstrating the entity
representations’ value as feature inputs. Our results here also show that entity
representations are competitive with multitask learning despite the
fact that they are trained without knowledge of the downstream
tasks. We discuss these results in the following subsections.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>User-Conditioned Representations</title>
      <p>Slow Changing Entities: Key Phrases and Contacts. We compare
unsupervised methods for producing key phrase and contact
representations in Table 2. For LDA, LSA, and NMF, we perform
hyperparameter tuning on a single enterprise user and report results for
all techniques with their best settings. Since the evaluation tasks
require representations for email as well as key phrases and contacts,
we use the Centroid email representation in each case to ensure a
fair comparison. Predictions are based on cosine similarity.2 NMF
with regularization outperformed all other methods. Regularization
leads to more efective representations for NMF; comparing
unregularized NMF to LSA suggests that non-negativity is also a helpful
bias. Some of the most substantial gains are in recall, especially
when compared to sparse TF-IDF baselines.</p>
      <p>Composition for Fast Changing Entities: Email. Diferent
compositional operations for representing email are explored in Table 3.
Because NMF performed best across all tasks, we restrict our
attention to these representations. The centroid method outperforms
others on accuracy, though the pseudoinverse approach is the best
for recall, presumably because it can incorporate information from
unigrams in the represented email and not just the key phrases
and contacts. A linear combination of centroid and pseudoinverse
representations provides the best results for accuracy and almost
matches pseudoinverse for recall.
5.2</p>
    </sec>
    <sec id="sec-13">
      <title>Task-Specific Models</title>
      <p>Task-Agnostic vs. Task-Specific. Unsupervised, task-agnostic
approaches are versatile and reusable, but they may underperform
relative to supervised models tuned to specific tasks. As described
in Section 4.1.1, we explore this tradeof by comparing the
performance of entity representation similarity-based methods against
task-specific baseline models trained on the evaluation tasks. For
Avocado, we see that while the accuracy is indeed better on
taskspecific Pre-trained and Pre-trained + Contacts compared to the best
representation methods (TF-IDF + NMF Centroid and TF-IDF NMF
Centroid + Pseudoinverse), 3 as shown in Table 4. However, the
TF-IDF + NMF Centroid + Pseudoinverse representations achieved
significantly better recall for all three content prediction tasks and
better accuracy in key phrase prediction, again indicating their
ability to avoid over-optimizing for frequently occurring entities.
This model produces even better results on the enterprise data
set, where its accuracy is competitive with both of the Pre-trained
models and its improvement in recall is even more dramatic. The
higher number of contacts in the enterprise set enables better joint
modeling with the content, allowing the entity representations to
perform better in this setting. We can see that unsupervised entity
representations are competitive with supervised baselines.</p>
      <p>Entity Representations as Input Features. As our results suggest,
user-conditioned entity representations are useful as input features
to supervised models. To assess their value as feature
representations, we compare task-specific models trained on entity
representations with task-specific baselines, as described in Section 4.1.1.
On Avocado, the entity representation-based task-specific models,
TF-IDF + NMF Centroid and TF-IDF NMF Centroid +
Pseudoinverse, outperform (or in a few cases match) the baselines on every
task and metric. We see similar results on enterprise data, except a
marginally lower reply prediction PR-AUC with entity-based
taskspecific models. Comparing the Avocado and enterprise results,
we can see that the performance on all tasks is much better on
enterprise users. Our hypothesis is that the larger contact
vocabulary in enterprise (1,431 contacts per user on average) compared
to Avocado (average 210 contacts per user) makes sender and
recipient tasks easier: the distractors are sampled from a larger pool
of contacts, and therefore less likely to be related and easier to
screen out. In the case of reply prediction, we believe the higher
PR-AUC stems from enterprise users that receive a higher volume
of machine-generated emails, which have more predictable reply
behavior.
5.3</p>
    </sec>
    <sec id="sec-14">
      <title>User-Conditioned vs. Global Models</title>
      <p>
        Each set of user-conditioned representations is trained on much
fewer data than most representation learning techniques, but
personalization is a powerful source of context. While our primary
reason for focusing on user-conditioned entity representations is
to avoid privacy leaks, we want to know how they compare against
2Reply prediction is dificult to evaluate in an unsupervised setting; hence, it is not
reported here.
3Our results for sender and recipient prediction through an unsupervised task-agnostic
representation are in the same range as those reported by Graus et al., [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] (0.66).
non-privacy-aware “global” representations trained on data from
every user in an organization. In Table 5 we see that user-conditioned
representations are significantly better on all tasks across all
metrics compared to the global versions of those representations. This
indicates that, for these models, the local context of a user is more
important than training on a larger data set. We see a similar trend
with the Pre-trained + Contacts and Global Pre-trained + Contacts
models, though the global variant outperforms the user-conditioned
one in sender prediction on Avocado. On reply prediction, global
models trained using representations perform similarly to Yang et
al. [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] without any task-specific feature engineering.
5.4
      </p>
    </sec>
    <sec id="sec-15">
      <title>Unsupervised vs. Multi-Task Approaches</title>
      <p>
        Our primary focus has been unsupervised entity representation
computation. An alternative approach is to induce representations
in a multitask learning setting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Multitask models often achieve
better performance than separate models trained on the same tasks
and, indeed, as seen in Table 4, the multitask model described in
Section 4.1.1 outperforms task specific models trained on the same
Pre-trained + Contacts feature representation.
      </p>
      <p>On Avocado, the best multitask model achieves significantly
better accuracy in sender and recipient prediction than Pre-trained and
Pre-trained + Contacts methods; entity-based task-specific methods
are still competitive on recall. We observe the same trend with
enterprise, where multi-task models outperform task-specific Pre-trained
+ Contacts, though entity-based task-specific models outperform
multi-task on all tasks and metrics except reply prediction PR-AUC.</p>
      <p>Thus the unsupervised methods presented here are competitive
with multitask learning on recall despite the fact that they are
trained without knowledge of the downstream tasks, and the
taskspecific entity-based models are competitive with the multi-task
method on accuracy and better on recall.
5.5</p>
    </sec>
    <sec id="sec-16">
      <title>The Efect of Data Size and Dimension</title>
      <p>To explore the impact of data density, Figure 4 plots sender
prediction accuracy using TF-IDF + NMF Centroid representations
against the number of emails in a user’s mailbox. Accuracy does
not vary substantially, though average recall improves: additional
data benefits representing entities on average. Similar trends for
other tasks and other models were observed.</p>
      <p>We plot the efect of varying dimension sizes for all tasks using
the TF-IDF + NMF Centroid approach in Figure 5 for Avocado users.
Representations of dimension 400 and 500 consistently achieve best
results for both accuracy and recall.</p>
      <sec id="sec-16-1">
        <title>Data</title>
        <p>Method</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>5.6 Practical Implications</title>
      <p>
        Our current implementation has several optimizations intended for
a production environment. We maintain updates to the TF-IDF
matrix in a streaming manner upon receipt of each email. A periodic
task, run every few days to every week, computes a fresh NMF
representation, using approximately one minute of computation time
per user with an optimized implementation based on Sparse BLAS
operations in Intel MKL. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] The process is running constantly for
thousands of users, scaling up to hundreds of thousands of users.
      </p>
    </sec>
    <sec id="sec-18">
      <title>6 CONCLUSIONS</title>
      <p>We have demonstrated approaches for learning task-agnostic
userconditioned embeddings that outperform strong baselines and
demonstrate value in a range of downstream tasks. User-conditioned
approaches are privacy preserving and show substantial benefits
over global models, despite their lower data density. These
promising results suggest a range of future directions to explore. One
clear next step is to extend our approach to include documents,
meetings, and other enterprise entities. Beyond that, embedding
relationships between entities could help in predicting more
complex connections between them. Next, our explorations in multitask
modeling suggest that generalization across tasks also has value.
Evaluating the impact of multitask representations on new tasks
through leave-one-out experiments may help quantify this.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Learning heterogeneous knowledge base embeddings for explainable recommendation</article-title>
          .
          <source>Algorithms</source>
          ,
          <volume>11</volume>
          (
          <issue>9</issue>
          ):
          <fpage>137</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Learning a hierarchical embedding model for personalized product search</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>645</fpage>
          -
          <lpage>654</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. O.</given-names>
            <surname>Amer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Géry</surname>
          </string-name>
          .
          <article-title>Toward word embedding for personalized information retrieval</article-title>
          .
          <source>CoRR, abs/1606.06991</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Balasubramanyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>Cutonce-recipient recommendation and leak detection in action</article-title>
          .
          <source>In AAAI, Workshop on Enhanced Messaging</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Zhang.</surname>
          </string-name>
          <article-title>TopicMF: Simultaneously exploiting ratings and reviews for recommendation</article-title>
          .
          <source>In Twenty-Eighth AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Carbonell.</surname>
          </string-name>
          <article-title>Combining probability-based rankers for actionitem detection</article-title>
          .
          <source>In Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference</source>
          , pages
          <fpage>324</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          , Mar.
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bratko</surname>
          </string-name>
          , B. Filipič,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Lynam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zupan</surname>
          </string-name>
          .
          <article-title>Spam filtering using statistical data compression models</article-title>
          .
          <source>JMLR</source>
          ,
          <volume>7</volume>
          :
          <fpage>2673</fpage>
          -
          <lpage>2698</lpage>
          , Dec.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          .
          <article-title>Multitask learning</article-title>
          .
          <source>Mach. Learn.</source>
          ,
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>July 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sohn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Gmail smart compose: Real-time assisted writing</article-title>
          .
          <source>In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp;#</source>
          <volume>38</volume>
          ;
          <string-name>
            <surname>Data</surname>
            <given-names>Mining</given-names>
          </string-name>
          ,
          <source>KDD '19</source>
          , pages
          <fpage>2287</fpage>
          -
          <lpage>2295</lpage>
          , New York, NY, USA,
          <year>2019</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          .
          <article-title>Query expansion with locally-trained word embeddings</article-title>
          .
          <source>CoRR, abs/1605.07891</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. B. Diederik</given-names>
            <surname>Kingma</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>In arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Magarick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blitzer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Intelligent email: Reply and attachment prediction</article-title>
          .
          <source>In Proceedings of the 13th International ACM Conference on Intelligent user interfaces</source>
          , pages
          <fpage>321</fpage>
          -
          <lpage>324</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Heckerman</surname>
          </string-name>
          .
          <article-title>Spam and the ongoing battle for the inbox</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>50</volume>
          (
          <issue>2</issue>
          ):
          <fpage>24</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Graus</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Van Dijk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tsagkias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. De Rijke</surname>
          </string-name>
          .
          <article-title>Recipient recommendation in enterprises using communication graphs and email content</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          , pages
          <fpage>1079</fpage>
          -
          <lpage>1082</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Hasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Automatic keyphrase extraction: A survey of the state of the art</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>1262</fpage>
          -
          <lpage>1273</lpage>
          , Baltimore, Maryland,
          <year>June 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <article-title>Personalized neural embeddings for collaborative filtering with text</article-title>
          .
          <source>arXiv preprint arXiv:1903.07860</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Intel</surname>
          </string-name>
          .
          <source>Intel Math Kernel Library. Reference Manual. Intel Corporation</source>
          , Santa Clara, USA,
          <year>2009</year>
          . ISBN 630813-054US.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          .
          <article-title>Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework</article-title>
          .
          <source>Journal of Global Optimization</source>
          ,
          <volume>58</volume>
          (
          <issue>2</issue>
          ):
          <fpage>285</fpage>
          -
          <lpage>319</lpage>
          ,
          <year>Feb 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Klimt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>The enron corpus: A new dataset for email classification research</article-title>
          .
          <source>In European Conference on Machine Learning</source>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kottur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          .
          <article-title>Exploring personalized neural conversational models</article-title>
          .
          <source>In IJCAI</source>
          , pages
          <fpage>3728</fpage>
          -
          <lpage>3734</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Seung</surname>
          </string-name>
          .
          <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>
          .
          <source>Nature</source>
          ,
          <volume>401</volume>
          :
          <fpage>788</fpage>
          -
          <lpage>791</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Levit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Anastasakos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dumoulin</surname>
          </string-name>
          .
          <article-title>Personalization of word-phrase-entity language models</article-title>
          .
          <source>In Proc. Interspeech</source>
          , pages
          <fpage>448</fpage>
          -
          <lpage>452</lpage>
          . ISCA - International Speech Communication Association,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          .
          <article-title>Dynamic embeddings for user proifling in twitter</article-title>
          .
          <source>In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          , pages
          <fpage>1764</fpage>
          -
          <lpage>1773</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>W.</given-names>
            <surname>Nawaz</surname>
          </string-name>
          , Y. Han, K.-U. Khan, and
          <string-name>
            <given-names>Y.-K.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Personalized email community detection using collaborative similarity measure</article-title>
          .
          <source>arXiv preprint:1306.1300</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Takasu</surname>
          </string-name>
          .
          <article-title>Npe: neural personalized embedding for collaborative ifltering</article-title>
          .
          <source>arXiv preprint arXiv:1805.06563</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , Doha, Qatar, Oct.
          <year>2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rattinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. L. Gof</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gütl</surname>
          </string-name>
          .
          <article-title>Local word embeddings for query expansion based on co-authorship and citations</article-title>
          .
          <source>In BIR@ECIR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Exponential family embeddings</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>478</fpage>
          -
          <lpage>486</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          :
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.-S.</given-names>
            <surname>Luk</surname>
          </string-name>
          .
          <article-title>Email mining: tasks, common techniques, and tools</article-title>
          .
          <source>Knowledge and Information Systems</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pantel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gamon</surname>
          </string-name>
          .
          <article-title>Representing text for joint embedding of text and knowledge bases</article-title>
          .
          <source>In EMNLP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>C. Van Gysel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Venanzi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rosemarin</surname>
            , G. Kukla,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Grudzien</surname>
            , and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Cancedda</surname>
          </string-name>
          .
          <article-title>Reply with: Proactive recommendation of email attachments</article-title>
          .
          <source>In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM</source>
          , pages
          <fpage>327</fpage>
          -
          <lpage>336</lpage>
          , New York, NY, USA,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J. B. P.</given-names>
            <surname>Vuurens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. P.</surname>
          </string-name>
          de Vries.
          <article-title>Exploring deep space: Learning personalized ranking in a semantic space</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>L. Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          . Starspace:
          <article-title>Embed all the things</article-title>
          ! In AAAI,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Awadallah</surname>
          </string-name>
          .
          <article-title>Characterizing and predicting enterprise email reply behavior</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>235</fpage>
          -
          <lpage>244</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.-C.</given-names>
            <surname>Moon</surname>
          </string-name>
          .
          <article-title>Mining social networks for personalized email prioritization</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>967</fpage>
          -
          <lpage>976</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>