The Email Analysis Framework: Aiding the Analysis of
               Personal Natural Language Texts

             Faisal Alquaddoomi                      Cameron Ketcham                      Deborah Estrin
           UCLA Comp. Science Dept.                   Cornell NYC Tech                    Cornell NYC Tech
               4732 Boelter Hall                     111 8th Avenue #302                 111 8th Avenue #302
             Los Angeles, CA, USA                    New York, NY 10011                  New York, NY 10011
              faisal@cs.ucla.edu                  cketcham@cornell.edu               destrin@cs.cornell.edu

ABSTRACT                                                          and special care must be taken to gain access to the data as
Free-form email text streams are a rich, yet seldom-tapped,       well as to preserve its privacy.
source of information about an individual’s internal state.
The difficulty in using this source of information is due par-    To summarize, the primary concerns are to securely collect,
tially to issues with obtaining and parsing these streams,        integrate, and analyze this often sensitive qualitative data.
and the sensitivity of the personal data they may contain.        This paper proposes the implementation of a framework, the
                                                                  ”Email Analysis Framework” (EAF), that consumes a user’s
This work presents a framework for allowing a user to au-         sent email and produces a set of quantitative models and
thorize the acquisition and processing of emails from their       statistics informed by the field of natural language process-
Gmail account in order to model the user’s use of language.       ing. While the scope of the project is currently to collect
The framework exposes a RESTful HTTPS API for third-              and process email, the intent is to expand the framework to
party apps to produce personal analytics for the user from        collect and integrate other sources of free text, for instance
their language model, over which the user maintains fine-         from social networking sites. It is hoped that the EAF will
grained control by selectively granting access via OAuth2.        be used as a proxy for these qualitative data sources, provid-
Candidate applications that consume the language models           ing a foundation upon which user-facing tools can be built
are discussed, including how they may derive personal ana-        to derive insights about this data for the individual in a
lytics from the provided data.                                    privacy-preserving way.

                                                                  The EAF is currently under active development, but an al-
Categories and Subject Descriptors                                pha version of the tool is available1 , as is the source code
I.2.7 [Natural Language Processing]: Text analysis                2
                                                                    . This paper principally describes the structure and de-
                                                                  sign choices in acquiring, analyzing, and disbursing sensi-
General Terms                                                     tive data. Applications are discussed in 4, which currently
Design                                                            consist of a completed sample EAF consumer that produces
                                                                  trivial visualizations as well as two more significant applica-
                                                                  tions that are currently in development.
1.   INTRODUCTION
As we interact with the world, we produce a profusion of          2.   APPROACH AND RELATED WORK
data across different modalities. Of particular interest is       As described in [13], the overarching intent of quantifying
the data we produce in communicating with other human             the self is to collect, integrate, and analyze data streams
beings, which could if collected and analyzed provide insight     that may be indicative of an individual’s physical, emotional,
into our relationships with others as well our own internal       and psychological state. The purpose of this analysis is to
state. This data often takes the form of free text which by its   promote awareness of how these measurable quantities both
nature is qualitative, and thus challenging to analyze with       affect and can be affected by the individual’s state, and to
quantitative methods. It is also frequently strewn across         provide support for decisions that change that state. As
various services. Some of these services expose the data for      mentioned previously, free text is both relatively easy to
public consumption, as in the case of social networking sites     collect and clearly carries much information about how we
like Twitter, Facebook, or posts on personal blogs. Other         feel about others and ourselves; indeed, it has been demon-
services are more private, such as email and text messaging,      strated that even our choices of words reflect our psycholog-
                                                                  ical state [11]. While this data may be present, it is in an
                                                                  opaque form that must be parsed into usable quantitative
                                                                  data.

                                                                  The analysis of free text has been extensively addressed in
                                                                  the field of natural language processing (NLP). NLP con-
                                                                  1
                                                                   https://eaf.smalldata.io
                                                                  2
                                                                   https://github.com/falquaddoomi/social_text_
                                                                  processor/
cerns itself with the broad task of comprehending (that is,       is an infographic and not raw data that can be reused by
unambiguously parsing) and extracting structured informa-         other components, whereas the EAF’s purpose is to facili-
tion from human language, which is accomplished through           tate analysis by other tools.
two main approaches: rule-based (aka grammatical) and sta-
tistical methods. The EAF primarily makes use of these
statistical methods, specifically n-gram language modeling,
                                                                  3.    ARCHITECTURE
                                                                  The EAF’s first task is to transform a user’s sent email mes-
to build a sequence of generative models of an individual’s
                                                                  sages into a series of tokens, where each token is tagged with
use of language over time.
                                                                  the time at which it was sent. This series of time-tagged to-
                                                                  kens constitutes a “stream”, from which the n-gram models
n-gram models are sufficiently descriptive of an individual’s
                                                                  mentioned previously are built. The stream is quantized
use of language that they can be used to discriminate one au-
                                                                  into intervals; the ordering of tokens within these intervals
thor from another purely by comparing descriptive statistics
                                                                  is not preserved from their originating messages (outside
computed over them, such as the entropy or the perplexity
                                                                  of their order in the n-grams), with the express intention
of the distributions [10, 14]. Descriptive statistics, such as
                                                                  of making it difficult to reconstruct the original text. Af-
the entropy of a language model mentioned previously, are
                                                                  ter quantization, the stream is then made available at the
of special appeal to privacy because they provide an essen-
                                                                  user’s discretion to third-party applications (“consumers”),
tial determination about the author without compromising
                                                                  with the ability for the user to configure per-consumer fil-
the original content from which the statistic was derived.
                                                                  ters that control what information that consumer can access.
                                                                  A few candidate consumers are discussed in the “Applica-
A user’s email is a unique corpus in that each document (i.e.
                                                                  tions” section 4. In order to mitigate the danger of storing
email) is tagged with a host of metadata, including the time
                                                                  sensitive user credentials, the EAF makes extensive use of
it was sent. Thus, computing language models over brackets
                                                                  the OAuth2[4] standard, both as an OAuth2 consumer (of
of emails close in time can provide “snapshots” of the evolu-
                                                                  Gmail, currently) and as an OAuth2 provider. The use of
tion of a user’s use of language over time. These snapshots
                                                                  OAuth2 also allows the user the freedom of revoking access
can be compared against each other to determine if there
                                                                  to the EAF should they wish to discontinue its use, or to re-
are shifts in the style of the user’s written communications
                                                                  voke access to third-party apps that had been authorized to
which could perhaps correlate to life events. There may be
                                                                  consume the EAF’s API. After the initial synchronization,
regularities in the changes of these models, or similarities to
                                                                  future emails that the user sends are automatically acquired
other people’s models with whom the individual interacts.
                                                                  by the system by periodically polling the provider.
The snapshots can be filtered by recipient or by communi-
cation mode to determine if the audience or medium deter-
mines the way an individual writes, or if there are detectable    3.1   Structure
groupings. Many more examples could be proposed for these         The EAF consists of three main components, as depicted in
temporal language models, especially when other sources of        figure 1: a web interface through which the user authorizes
time-based data (location, activity, calendar events, etc.)       access to their Gmail account and performs administrative
are introduced. One of the EAF’s main goals is to provide         tasks, a task processor which acquires the user’s email and
infrastructure to build and maintain these models, as well        produces a token stream from it, and a second web interface
as allow them, and the descriptive statistics derived from        which faces consumers of the token stream. Both web inter-
them, to be released at the user’s discretion for comparison      faces are implemented in Django 1.7, a framework for rapid
to other data sources.                                            development of web applications in Python. Authorization
                                                                  to third-party services is facilitated by Django-Allauth, a
There are other frameworks which provide similar analytical       project that allows Django’s built-in authentication system
capabilities, notably the General Architecture for Text En-       to interoperate with a host of OAuth2 providers, includ-
gineering (GATE) [2]. There are also numerous libraries and       ing Gmail. The task processor makes use of Celery, a dis-
toolkits [3, 6] that include the same features that the EAF       tributed task queue processor that is often used in concert
provides – in fact, the EAF makes use of the popular nltk         with Django. Both components communicate via a shared
library [1] to perform many of its functions. The EAF differs     database, specifically PostgreSQL, which was chosen for its
from these projects in its context: it is a deployable system     performance under heavy, highly concurrent loads.
focused on centralizing the secure acquisition and processing
of emails for many users. It provides user-facing adminis-        The framework exposes a RESTful HTTPS interface to al-
trative interfaces to control it, and app-facing APIs to make     low third-party applications to consume the token stream.
use of its results. The EAF’s intent is to allow users to make    The implementation of this interface was aided by the Django-
sense of their own data, and uses a fine-grained opt-in per-      REST-framework, and the specifications of the interface fol-
mission system fully controlled by the user to help protect       low the openmHealth DSU specification v1.0, [9]. The user-
against malicious or unintended use of the user’s email data.     facing web interface makes use of the RESTful interface it-
                                                                  self for querying the token stream. In order to allow regis-
In the context of email analysis, the MIT Media Lab’s Im-         tered third-party sites to gain access to the user’s email data
mersion project[7] shares the EAF’s goal of using one’s email     for analysis and visualization, the EAF acts as an OAuth2
for the purpose of personal insight and self-reflection. Unlike   provider; third-party sites must involve the user in their re-
the EAF, the Immersion project restricts itself to analysis of    quest for a temporary access token, which they can subse-
the user’s social group through reading the “From” and “To”       quently use to make requests on the user’s behalf.
fields of email header – no examination of the body text is
performed. Further, the output of the Immersion project
                                                                  3.2   User Interaction
Figure 1: Structure of the Email Analysis Frame-
work


Prior to using the system the user first creates an EAF site
account which acts as an aggregation point for the multiple                    Figure 2: Gmail Authorization
email accounts they might want to use. At the moment this
account creation is performed automatically when the user
authorizes their first email account; the account they autho-     is periodically updated. The process can be quite lengthy,
rize (or any other linked account) then implicitly logs them      especially in the case where there is a large backlog of mes-
in to their site account, although this behavior is subject to    sages to process, so the user is permitted to close the view
change in the future.                                             and return to the site at their convenience to check in on
                                                                  the task’s progress. Upon completion, the framework sends
In their interaction with the system, the user proceeds through   a notification email which includes the total duration of the
three stages:                                                     acquisition task. At this point, the user can view the results
                                                                  of the acquisition process in a small built-in visualization
                                                                  dashboard that shows a few summarizing statistics about
  1. Authorization, in which the user is prompted to re-          their token stream plotted over time. Incremental acqui-
     lease temporary credentials used to access their email       sition tasks that occur after the initial acquisition do not
     account via OAuth2.                                          trigger a notification.
  2. Acquisition, during which the user monitors the progress
                                                                  Since the framework is intended to model the user’s use of
     of the system as it downloads their emails and per-
                                                                  language and not the language of other individuals with
     forms filtering/transformations before inserting them
                                                                  whom the user is conversing, it is necessary to strip quo-
     into the database as a stream of tokens.
                                                                  tations and reply text from the emails prior to processing.
  3. Release, in which the user selects which consumers           Isolating only the user’s text in sent mail is accomplished
                                                                  through an adapted version of the email reply parser 3 li-
     can access their token stream and what filtering/transformations
     will be applied for that consumer.                           brary, developed by GitHub.


3.2.1    Authorization                                            Ongoing Acquisition. In the background task processor,
The authorization stage is initiated when the user visits
                                                                  the acquisition task consists of using the previously-obtained
the web interface. Using a standard OAuth2 handshake,
                                                                  OAuth2 credentials to authenticate to Google’s IMAP server.
the user is redirected to Google’s Gmail authorization page,
                                                                  The task then proceeds to download the user’s sent email
where they log in (or use a previously logged-in session) and
                                                                  (that is, the contents of “GMail\[Sent Mail]”) in chronolog-
then accept the permissions which the framework requests,
                                                                  ical order, skipping messages which have been recorded as
specifically access to the user’s email. If the user provides
                                                                  processed in a previous iteration of the task. Each email is
their consent, they are returned to the EAF where they can
                                                                  passed through a series of filters, called the “pre-filter chain”,
proceed to acquisition. If the user does not provide consent
                                                                  which ultimately results in a sequence of tokens that are as-
or some other error occurs, they are returned to the frame-
                                                                  sociated with the email account, the user’s EAF site account,
work with an error message and are prompted to try again.
                                                                  and the time at which the email was sent. By default, the
Multiple email accounts can be associated with a single EAF
                                                                  first filter in the chain performs tokenization: each email
site account, in which case selecting an account from the list
                                                                  is split on newlines and punctuation into tokens, which are
of registered accounts begins the next stage, acquisition.
                                                                  converted to lowercase to reduce the number of possible to-
                                                                  kens due to capitalization differences, and stripped of num-
3.2.2    Acquisition                                              bers and quotation marks. The second filter is the “ignored
                                                                  words” filter, which allows the user to selectively prohibit
Initial Acquisition. Acquisition starts immediately after         certain words from ever entering the database. At the mo-
authorization and is handled by the background task pro-
                                                                  3
cessor. The user is shown a view of the task’s progress which         https://github.com/github/email_reply_parser
ment, the ignored words must be manually entered, which
makes filtering passwords and other sensitive information
problematic, given that the ignored list itself is then sensi-
tive. This will be addressed in the subsection on filter types,
3.3.

After the filter chain runs, the tokens are then written to
the database. Rather than store repeated tokens individu-
ally, each token is stored with a count of the number of times
it occurred within its message. If same token occurs in differ-
ent messages, it is stored separately for each message. This
choice was made as a compromise between allowing for flex-
ible choice of the interval into which tokens are combined
when the stream is consumed and consuming less space in
the database; if the system were designed with a fixed inter-
val rather than a flexible one, the tokens would simply be
combined into a histogram for each interval.


                                                                              Figure 4: Release to Consumer


                                                                  As previously mentioned, both the acquisition and release
                                                                  stages employ a sequence of filters that allow the input
                                                                  data to be selectively screened for sensitive terms and oth-
                                                                  erwise transformed to better suit the requirements of the
                                                                  consumers. The acquisition stage’s filter chain is referred to
                                                                  as the “pre-filter chain” and the release stage’s is the “post-
                                                                  filter chain”. There is only a single pre-filter chain, but there
                                                                  can be as many as one post-filter chain for each registered
                                                                  consumer.

                                                                  The pre-filter chain always has a special “tokenize” filter as
                                                                  its first filter, which produces the initial sequence of tokens
                                                                  for filtering and transformation, and may only be used in
                                                                  the pre-filter chain. A second special filter that may only be
                                                                  used in the pre-filtering step is the “ignore word sequence”
               Figure 3: Mail Acquisition                         filter, which ignores the sequence of tokens configured in the
                                                                  filter, and was initially created to ignore signature blocks.
                                                                  This filter can only function in the pre-filtering step as the
3.2.3    Release                                                  exact sequence of the tokens is lost upon insertion into the
Once the user has found an EAF-compatible application,            database.
they can authorize that application to access their token
stream via OAuth2. In this stage the EAF acts as an OAuth2        Aside from the special “tokenize” filter, there are a few other
provider, providing a page to which the third-party applica-      filters which can only be used in the pre-filtering step, namely:
tion can redirect the user to be prompted for authorization
of their identity via Gmail (used also as credentials to access
their EAF data) and permission to access their token stream.         • Parts-of-Speech Tagger, which replaces each token
In the case where the user has multiple token streams, they            with its detected part of speech (noun, verb, etc.)
will be prompted to choose the streams to which they are
granting access. On this page, the user selects a filter chain       • Fork, which produces an identical stream to the cur-
for each stream that will be used specifically with this con-          rent one, but with its own sub-filter chain. The tokens
sumer, or simply opt not to filter the stream at all. The              that are produced from a fork are tagged with a unique
process is detailed in figure 4.                                       ID corresponding to that fork.

After this point, the consumer can request updates from
the token stream at any time. The EAF audits all accesses,        The “fork” filter is especially useful in conjunction with the
displays the last time of access, and allows the user to revoke   part-of-speech tagger, as both the original text and the parts-
the consumer’s access at any time or change the filter chain      of-speech stream can be either individually released or re-
associated with that consumer.                                    leased together, which allows for analysis of the user’s gram-
                                                                  mar. Note that the parts-of-speech stream does preserve the
                                                                  order of tokens in the original stream, but not the text of
3.3     Filtering                                                 the tokens themselves.
The filter framework is modular, with the potential to add
new filters easily in the future. At the moment, a few pa-
rameterizable filters are implemented to change the case of
tokens, strip specific characters, and to remove words that
are either on a user-specified list or not contained within
aspell’s “en US” dictionary. Detecting and ignoring named
entities is a work in progress.


     • Change Case, which transforms the case of the to-
       kens;
     • Strip Characters, which can be used to remove num-
       bers and other special characters;
     • Ignore Listed Words, which removes tokens on an
       “ignore” list from the token stream; and
     • Ignore Non-Dictionary Words, which removes to-
       kens not found in a common English dictionary
                                                                  t]
By utilizing the “ignore words” filters, the user is allowed
                                                                          Figure 5: “Word Cloud” Visualization
fine-grained control of both the contents of the EAF’s database
and the views of the token streams presented to different
consumers.
                                                                  4.2    Pulse and Partner
                                                                  In addition to the sample application discussed above, our
4.    APPLICATIONS                                                group is currently developing two applications that make use
As mentioned, third-party applications can gain temporary         of statistics computed against the user’s email. The first is
access to a user’s data for the purpose of visualizing or oth-    “Pulse”, which makes use of location traces from the user’s
erwise processing it. Granting this access is currently at        smartphone as well as the frequency and variety of individ-
the user’s discretion; the user should make use of the per-       uals with whom one communicates to compute a score that
consumer post-filter controls to limit the release of sensi-      indicates how rather than what the individual is doing.
tive information to potentially untrustworthy parties. Con-       This score is visualized as a waveform over a short window
sumers sign requests to the EAF’s RESTful JSON API with           of time (i.e. a week), which can be shared with family mem-
an access token, obtained through the process described in        bers and friends. The second is “Partner”, which is intended
the “Release” section above 3.2.3.                                to measure the degree to which linguistic style matching oc-
                                                                  curs among individuals who interact with each other face to
4.1     Example: Mail Visualization Front-End                     face, a fairly well-documented phenomenon [8], [5]. Partner
In order to demonstrate the functionality of the framework,       makes use of the location traces of two or more individuals
a visualization front-end was developed that consumes the         as well as computed statistics over their emails to produce
framework’s API and produces a few example visualizations.        two scores, a “proximity” and a “language-style matching”
The front-end also serves as a reference implementation of        score, which will be visualized as individual timeseries. A
EAF client authentication; it first requests permission from      third timeseries will display their computed correlation over
the user before gaining access to their token stream. The vi-     time.
sualization front-end currently offers the following modules:

                                                                  5.    CONCLUSION, FUTURE WORK
     • Word Cloud - a “word cloud” infographic that can be        The Email Analysis Framework, a system for extracting
       viewed on a per-week basis (figure 5).                     structured, easily-consumed data from time-tagged natural-
                                                                  language text was proposed in this work. At the moment it is
     • Rhythm - a table of the days of the week as the            limited to acquiring text from Gmail, computing, and expos-
       columns and hours of the day as the rows is colored        ing language models to other tools via a RESTful HTTPS
       according to the number of emails sent within each in-     API, but it is hoped to be extended to other sources of per-
       terval, with darker colors corresponding to more emails    sonal natural-language text, such as Facebook and Twitter
       sent (a heatmap, essentially; figure 6).                   streams. A few candidate visualizations were described to
     • Alters - a bar chart of the number of people contacted     both demonstrate how the data could be used and to stim-
       per week; when a week is selected, displays a bar chart    ulate investigation into more novel applications.
       of emails sent to each person.
                                                                  In terms of future work, there are extensions planned to
                                                                  all the stages of the framework. As mentioned, the scope of
These visualization modules are intended to be a jumping-         providers is intended to be expanded to other text providers,
off point for more useful visualizations to be constructed,       which will allow analysis to be performed on how different
which would ideally incorporate data from other sources to        media affect the language model. Additional streams can
strengthen inferences about the user’s overall state.             be extracted in the processing phase, such as identifying
                                                                 [4] D. Hammer-Lahav and D. Hardt. The oauth2. 0
                                                                     authorization protocol. 2011. Technical report, IETF
                                                                     Internet Draft, 2011.
                                                                 [5] M. E. Ireland, R. B. Slatcher, P. W. Eastwick, L. E.
                                                                     Scissors, E. J. Finkel, and J. W. Pennebaker.
                                                                     Language style matching predicts relationship
                                                                     initiation and stability. Psychological Science,
                                                                     22(1):39–44, 2011.
                                                                 [6] A. Mallet. Java-based packed for statistical nlp
                                                                     toolkit. Available at (accessed 26.01. 10), 2010.
                                                                 [7] Mit media lab’s immersion project.
                                                                     https://immersion.media.mit.edu/. Accessed:
                                                                     2014-02-04.
                                                                 [8] K. G. Niederhoffer and J. W. Pennebaker. Linguistic
                                                                     style matching in social interaction. Journal of
                                                                     Language and Social Psychology, 21(4):337–360, 2002.
                                                                 [9] Ohmage dsu 1.0 specification.
                                                                     https://github.com/openmhealth/developer/wiki/
                                                                     DSU-1.0-Specification. Accessed: 2014-02-04.
                                                                [10] F. Peng, D. Schuurmans, V. Keselj, and S. Wang.
                                                                     Automated authorship attribution with character level
                                                                     language models. In 10th Conference of the European
                                                                     Chapter of the Association for Computational
                                                                     Linguistics (EACL 2003), pages 267–274, 2003.
                                                                [11] J. W. Pennebaker, M. R. Mehl, and K. G.
                                                                     Niederhoffer. Psychological aspects of natural
                                                                     language use: Our words, our selves. Annual review of
                                                                     psychology, 54(1):547–577, 2003.
                                                                [12] E. Riloff and W. Phillips. An introduction to the
                                                                     sundance and autoslog systems. Technical report,
                                                                     Technical Report UUCS-04-015, School of Computing,
          Figure 6: “Rhythm” Visualization                           University of Utah, 2004.
                                                                [13] M. Swan. The quantified self: Fundamental disruption
                                                                     in big data science and biological discovery. Big Data,
named entities and topics, all of which can be analyzed over         1(2):85–99, 2013.
time, audience, etc. Industry-standard information extrac-      [14] Y. Zhao, J. Zobel, and P. Vines. Using relative
tion techniques such as autoslog [12] could be applied to            entropy for authorship attribution. In Information
discover meeting arrangements, events that occur to named            Retrieval Technology, pages 92–105. Springer, 2006.
entities or topics mentioned in the emails, and so on. Sen-
timent analysis could be computed and exposed as another
temporal stream, to attempt to model the user’s disposition
as a function of time. Additional third-party applications
are planned, such as a tool for determining points of inflec-
tion in the descriptive statistics computed on the language
model, and a tool to easily correlate other time-based data
against the statistics streams.


6.   REFERENCES
 [1] S. Bird. Nltk: the natural language toolkit. In
     Proceedings of the COLING/ACL on Interactive
     presentation sessions, pages 69–72. Association for
     Computational Linguistics, 2006.
 [2] H. Cunningham, D. Maynard, K. Bontcheva, and
     V. Tablan. Gate: an architecture for development of
     robust hlt applications. In Proceedings of the 40th
     annual meeting on association for computational
     linguistics, pages 168–175. Association for
     Computational Linguistics, 2002.
 [3] D. Ferrucci and A. Lally. Uima: an architectural
     approach to unstructured information processing in
     the corporate research environment. Natural Language
     Engineering, 10(3-4):327–348, 2004.