The Email Analysis Framework: Aiding the Analysis of Personal Natural Language Texts Faisal Alquaddoomi Cameron Ketcham Deborah Estrin UCLA Comp. Science Dept. Cornell NYC Tech Cornell NYC Tech 4732 Boelter Hall 111 8th Avenue #302 111 8th Avenue #302 Los Angeles, CA, USA New York, NY 10011 New York, NY 10011 faisal@cs.ucla.edu cketcham@cornell.edu destrin@cs.cornell.edu ABSTRACT and special care must be taken to gain access to the data as Free-form email text streams are a rich, yet seldom-tapped, well as to preserve its privacy. source of information about an individual’s internal state. The difficulty in using this source of information is due par- To summarize, the primary concerns are to securely collect, tially to issues with obtaining and parsing these streams, integrate, and analyze this often sensitive qualitative data. and the sensitivity of the personal data they may contain. This paper proposes the implementation of a framework, the ”Email Analysis Framework” (EAF), that consumes a user’s This work presents a framework for allowing a user to au- sent email and produces a set of quantitative models and thorize the acquisition and processing of emails from their statistics informed by the field of natural language process- Gmail account in order to model the user’s use of language. ing. While the scope of the project is currently to collect The framework exposes a RESTful HTTPS API for third- and process email, the intent is to expand the framework to party apps to produce personal analytics for the user from collect and integrate other sources of free text, for instance their language model, over which the user maintains fine- from social networking sites. It is hoped that the EAF will grained control by selectively granting access via OAuth2. be used as a proxy for these qualitative data sources, provid- Candidate applications that consume the language models ing a foundation upon which user-facing tools can be built are discussed, including how they may derive personal ana- to derive insights about this data for the individual in a lytics from the provided data. privacy-preserving way. The EAF is currently under active development, but an al- Categories and Subject Descriptors pha version of the tool is available1 , as is the source code I.2.7 [Natural Language Processing]: Text analysis 2 . This paper principally describes the structure and de- sign choices in acquiring, analyzing, and disbursing sensi- General Terms tive data. Applications are discussed in 4, which currently Design consist of a completed sample EAF consumer that produces trivial visualizations as well as two more significant applica- tions that are currently in development. 1. INTRODUCTION As we interact with the world, we produce a profusion of 2. APPROACH AND RELATED WORK data across different modalities. Of particular interest is As described in [13], the overarching intent of quantifying the data we produce in communicating with other human the self is to collect, integrate, and analyze data streams beings, which could if collected and analyzed provide insight that may be indicative of an individual’s physical, emotional, into our relationships with others as well our own internal and psychological state. The purpose of this analysis is to state. This data often takes the form of free text which by its promote awareness of how these measurable quantities both nature is qualitative, and thus challenging to analyze with affect and can be affected by the individual’s state, and to quantitative methods. It is also frequently strewn across provide support for decisions that change that state. As various services. Some of these services expose the data for mentioned previously, free text is both relatively easy to public consumption, as in the case of social networking sites collect and clearly carries much information about how we like Twitter, Facebook, or posts on personal blogs. Other feel about others and ourselves; indeed, it has been demon- services are more private, such as email and text messaging, strated that even our choices of words reflect our psycholog- ical state [11]. While this data may be present, it is in an opaque form that must be parsed into usable quantitative data. The analysis of free text has been extensively addressed in the field of natural language processing (NLP). NLP con- 1 https://eaf.smalldata.io 2 https://github.com/falquaddoomi/social_text_ processor/ cerns itself with the broad task of comprehending (that is, is an infographic and not raw data that can be reused by unambiguously parsing) and extracting structured informa- other components, whereas the EAF’s purpose is to facili- tion from human language, which is accomplished through tate analysis by other tools. two main approaches: rule-based (aka grammatical) and sta- tistical methods. The EAF primarily makes use of these statistical methods, specifically n-gram language modeling, 3. ARCHITECTURE The EAF’s first task is to transform a user’s sent email mes- to build a sequence of generative models of an individual’s sages into a series of tokens, where each token is tagged with use of language over time. the time at which it was sent. This series of time-tagged to- kens constitutes a “stream”, from which the n-gram models n-gram models are sufficiently descriptive of an individual’s mentioned previously are built. The stream is quantized use of language that they can be used to discriminate one au- into intervals; the ordering of tokens within these intervals thor from another purely by comparing descriptive statistics is not preserved from their originating messages (outside computed over them, such as the entropy or the perplexity of their order in the n-grams), with the express intention of the distributions [10, 14]. Descriptive statistics, such as of making it difficult to reconstruct the original text. Af- the entropy of a language model mentioned previously, are ter quantization, the stream is then made available at the of special appeal to privacy because they provide an essen- user’s discretion to third-party applications (“consumers”), tial determination about the author without compromising with the ability for the user to configure per-consumer fil- the original content from which the statistic was derived. ters that control what information that consumer can access. A few candidate consumers are discussed in the “Applica- A user’s email is a unique corpus in that each document (i.e. tions” section 4. In order to mitigate the danger of storing email) is tagged with a host of metadata, including the time sensitive user credentials, the EAF makes extensive use of it was sent. Thus, computing language models over brackets the OAuth2[4] standard, both as an OAuth2 consumer (of of emails close in time can provide “snapshots” of the evolu- Gmail, currently) and as an OAuth2 provider. The use of tion of a user’s use of language over time. These snapshots OAuth2 also allows the user the freedom of revoking access can be compared against each other to determine if there to the EAF should they wish to discontinue its use, or to re- are shifts in the style of the user’s written communications voke access to third-party apps that had been authorized to which could perhaps correlate to life events. There may be consume the EAF’s API. After the initial synchronization, regularities in the changes of these models, or similarities to future emails that the user sends are automatically acquired other people’s models with whom the individual interacts. by the system by periodically polling the provider. The snapshots can be filtered by recipient or by communi- cation mode to determine if the audience or medium deter- mines the way an individual writes, or if there are detectable 3.1 Structure groupings. Many more examples could be proposed for these The EAF consists of three main components, as depicted in temporal language models, especially when other sources of figure 1: a web interface through which the user authorizes time-based data (location, activity, calendar events, etc.) access to their Gmail account and performs administrative are introduced. One of the EAF’s main goals is to provide tasks, a task processor which acquires the user’s email and infrastructure to build and maintain these models, as well produces a token stream from it, and a second web interface as allow them, and the descriptive statistics derived from which faces consumers of the token stream. Both web inter- them, to be released at the user’s discretion for comparison faces are implemented in Django 1.7, a framework for rapid to other data sources. development of web applications in Python. Authorization to third-party services is facilitated by Django-Allauth, a There are other frameworks which provide similar analytical project that allows Django’s built-in authentication system capabilities, notably the General Architecture for Text En- to interoperate with a host of OAuth2 providers, includ- gineering (GATE) [2]. There are also numerous libraries and ing Gmail. The task processor makes use of Celery, a dis- toolkits [3, 6] that include the same features that the EAF tributed task queue processor that is often used in concert provides – in fact, the EAF makes use of the popular nltk with Django. Both components communicate via a shared library [1] to perform many of its functions. The EAF differs database, specifically PostgreSQL, which was chosen for its from these projects in its context: it is a deployable system performance under heavy, highly concurrent loads. focused on centralizing the secure acquisition and processing of emails for many users. It provides user-facing adminis- The framework exposes a RESTful HTTPS interface to al- trative interfaces to control it, and app-facing APIs to make low third-party applications to consume the token stream. use of its results. The EAF’s intent is to allow users to make The implementation of this interface was aided by the Django- sense of their own data, and uses a fine-grained opt-in per- REST-framework, and the specifications of the interface fol- mission system fully controlled by the user to help protect low the openmHealth DSU specification v1.0, [9]. The user- against malicious or unintended use of the user’s email data. facing web interface makes use of the RESTful interface it- self for querying the token stream. In order to allow regis- In the context of email analysis, the MIT Media Lab’s Im- tered third-party sites to gain access to the user’s email data mersion project[7] shares the EAF’s goal of using one’s email for analysis and visualization, the EAF acts as an OAuth2 for the purpose of personal insight and self-reflection. Unlike provider; third-party sites must involve the user in their re- the EAF, the Immersion project restricts itself to analysis of quest for a temporary access token, which they can subse- the user’s social group through reading the “From” and “To” quently use to make requests on the user’s behalf. fields of email header – no examination of the body text is performed. Further, the output of the Immersion project 3.2 User Interaction Figure 1: Structure of the Email Analysis Frame- work Prior to using the system the user first creates an EAF site account which acts as an aggregation point for the multiple Figure 2: Gmail Authorization email accounts they might want to use. At the moment this account creation is performed automatically when the user authorizes their first email account; the account they autho- is periodically updated. The process can be quite lengthy, rize (or any other linked account) then implicitly logs them especially in the case where there is a large backlog of mes- in to their site account, although this behavior is subject to sages to process, so the user is permitted to close the view change in the future. and return to the site at their convenience to check in on the task’s progress. Upon completion, the framework sends In their interaction with the system, the user proceeds through a notification email which includes the total duration of the three stages: acquisition task. At this point, the user can view the results of the acquisition process in a small built-in visualization dashboard that shows a few summarizing statistics about 1. Authorization, in which the user is prompted to re- their token stream plotted over time. Incremental acqui- lease temporary credentials used to access their email sition tasks that occur after the initial acquisition do not account via OAuth2. trigger a notification. 2. Acquisition, during which the user monitors the progress Since the framework is intended to model the user’s use of of the system as it downloads their emails and per- language and not the language of other individuals with forms filtering/transformations before inserting them whom the user is conversing, it is necessary to strip quo- into the database as a stream of tokens. tations and reply text from the emails prior to processing. 3. Release, in which the user selects which consumers Isolating only the user’s text in sent mail is accomplished through an adapted version of the email reply parser 3 li- can access their token stream and what filtering/transformations will be applied for that consumer. brary, developed by GitHub. 3.2.1 Authorization Ongoing Acquisition. In the background task processor, The authorization stage is initiated when the user visits the acquisition task consists of using the previously-obtained the web interface. Using a standard OAuth2 handshake, OAuth2 credentials to authenticate to Google’s IMAP server. the user is redirected to Google’s Gmail authorization page, The task then proceeds to download the user’s sent email where they log in (or use a previously logged-in session) and (that is, the contents of “GMail\[Sent Mail]”) in chronolog- then accept the permissions which the framework requests, ical order, skipping messages which have been recorded as specifically access to the user’s email. If the user provides processed in a previous iteration of the task. Each email is their consent, they are returned to the EAF where they can passed through a series of filters, called the “pre-filter chain”, proceed to acquisition. If the user does not provide consent which ultimately results in a sequence of tokens that are as- or some other error occurs, they are returned to the frame- sociated with the email account, the user’s EAF site account, work with an error message and are prompted to try again. and the time at which the email was sent. By default, the Multiple email accounts can be associated with a single EAF first filter in the chain performs tokenization: each email site account, in which case selecting an account from the list is split on newlines and punctuation into tokens, which are of registered accounts begins the next stage, acquisition. converted to lowercase to reduce the number of possible to- kens due to capitalization differences, and stripped of num- 3.2.2 Acquisition bers and quotation marks. The second filter is the “ignored words” filter, which allows the user to selectively prohibit Initial Acquisition. Acquisition starts immediately after certain words from ever entering the database. At the mo- authorization and is handled by the background task pro- 3 cessor. The user is shown a view of the task’s progress which https://github.com/github/email_reply_parser ment, the ignored words must be manually entered, which makes filtering passwords and other sensitive information problematic, given that the ignored list itself is then sensi- tive. This will be addressed in the subsection on filter types, 3.3. After the filter chain runs, the tokens are then written to the database. Rather than store repeated tokens individu- ally, each token is stored with a count of the number of times it occurred within its message. If same token occurs in differ- ent messages, it is stored separately for each message. This choice was made as a compromise between allowing for flex- ible choice of the interval into which tokens are combined when the stream is consumed and consuming less space in the database; if the system were designed with a fixed inter- val rather than a flexible one, the tokens would simply be combined into a histogram for each interval. Figure 4: Release to Consumer As previously mentioned, both the acquisition and release stages employ a sequence of filters that allow the input data to be selectively screened for sensitive terms and oth- erwise transformed to better suit the requirements of the consumers. The acquisition stage’s filter chain is referred to as the “pre-filter chain” and the release stage’s is the “post- filter chain”. There is only a single pre-filter chain, but there can be as many as one post-filter chain for each registered consumer. The pre-filter chain always has a special “tokenize” filter as its first filter, which produces the initial sequence of tokens for filtering and transformation, and may only be used in the pre-filter chain. A second special filter that may only be used in the pre-filtering step is the “ignore word sequence” Figure 3: Mail Acquisition filter, which ignores the sequence of tokens configured in the filter, and was initially created to ignore signature blocks. This filter can only function in the pre-filtering step as the 3.2.3 Release exact sequence of the tokens is lost upon insertion into the Once the user has found an EAF-compatible application, database. they can authorize that application to access their token stream via OAuth2. In this stage the EAF acts as an OAuth2 Aside from the special “tokenize” filter, there are a few other provider, providing a page to which the third-party applica- filters which can only be used in the pre-filtering step, namely: tion can redirect the user to be prompted for authorization of their identity via Gmail (used also as credentials to access their EAF data) and permission to access their token stream. • Parts-of-Speech Tagger, which replaces each token In the case where the user has multiple token streams, they with its detected part of speech (noun, verb, etc.) will be prompted to choose the streams to which they are granting access. On this page, the user selects a filter chain • Fork, which produces an identical stream to the cur- for each stream that will be used specifically with this con- rent one, but with its own sub-filter chain. The tokens sumer, or simply opt not to filter the stream at all. The that are produced from a fork are tagged with a unique process is detailed in figure 4. ID corresponding to that fork. After this point, the consumer can request updates from the token stream at any time. The EAF audits all accesses, The “fork” filter is especially useful in conjunction with the displays the last time of access, and allows the user to revoke part-of-speech tagger, as both the original text and the parts- the consumer’s access at any time or change the filter chain of-speech stream can be either individually released or re- associated with that consumer. leased together, which allows for analysis of the user’s gram- mar. Note that the parts-of-speech stream does preserve the order of tokens in the original stream, but not the text of 3.3 Filtering the tokens themselves. The filter framework is modular, with the potential to add new filters easily in the future. At the moment, a few pa- rameterizable filters are implemented to change the case of tokens, strip specific characters, and to remove words that are either on a user-specified list or not contained within aspell’s “en US” dictionary. Detecting and ignoring named entities is a work in progress. • Change Case, which transforms the case of the to- kens; • Strip Characters, which can be used to remove num- bers and other special characters; • Ignore Listed Words, which removes tokens on an “ignore” list from the token stream; and • Ignore Non-Dictionary Words, which removes to- kens not found in a common English dictionary t] By utilizing the “ignore words” filters, the user is allowed Figure 5: “Word Cloud” Visualization fine-grained control of both the contents of the EAF’s database and the views of the token streams presented to different consumers. 4.2 Pulse and Partner In addition to the sample application discussed above, our 4. APPLICATIONS group is currently developing two applications that make use As mentioned, third-party applications can gain temporary of statistics computed against the user’s email. The first is access to a user’s data for the purpose of visualizing or oth- “Pulse”, which makes use of location traces from the user’s erwise processing it. Granting this access is currently at smartphone as well as the frequency and variety of individ- the user’s discretion; the user should make use of the per- uals with whom one communicates to compute a score that consumer post-filter controls to limit the release of sensi- indicates how rather than what the individual is doing. tive information to potentially untrustworthy parties. Con- This score is visualized as a waveform over a short window sumers sign requests to the EAF’s RESTful JSON API with of time (i.e. a week), which can be shared with family mem- an access token, obtained through the process described in bers and friends. The second is “Partner”, which is intended the “Release” section above 3.2.3. to measure the degree to which linguistic style matching oc- curs among individuals who interact with each other face to 4.1 Example: Mail Visualization Front-End face, a fairly well-documented phenomenon [8], [5]. Partner In order to demonstrate the functionality of the framework, makes use of the location traces of two or more individuals a visualization front-end was developed that consumes the as well as computed statistics over their emails to produce framework’s API and produces a few example visualizations. two scores, a “proximity” and a “language-style matching” The front-end also serves as a reference implementation of score, which will be visualized as individual timeseries. A EAF client authentication; it first requests permission from third timeseries will display their computed correlation over the user before gaining access to their token stream. The vi- time. sualization front-end currently offers the following modules: 5. CONCLUSION, FUTURE WORK • Word Cloud - a “word cloud” infographic that can be The Email Analysis Framework, a system for extracting viewed on a per-week basis (figure 5). structured, easily-consumed data from time-tagged natural- language text was proposed in this work. At the moment it is • Rhythm - a table of the days of the week as the limited to acquiring text from Gmail, computing, and expos- columns and hours of the day as the rows is colored ing language models to other tools via a RESTful HTTPS according to the number of emails sent within each in- API, but it is hoped to be extended to other sources of per- terval, with darker colors corresponding to more emails sonal natural-language text, such as Facebook and Twitter sent (a heatmap, essentially; figure 6). streams. A few candidate visualizations were described to • Alters - a bar chart of the number of people contacted both demonstrate how the data could be used and to stim- per week; when a week is selected, displays a bar chart ulate investigation into more novel applications. of emails sent to each person. In terms of future work, there are extensions planned to all the stages of the framework. As mentioned, the scope of These visualization modules are intended to be a jumping- providers is intended to be expanded to other text providers, off point for more useful visualizations to be constructed, which will allow analysis to be performed on how different which would ideally incorporate data from other sources to media affect the language model. Additional streams can strengthen inferences about the user’s overall state. be extracted in the processing phase, such as identifying [4] D. Hammer-Lahav and D. Hardt. The oauth2. 0 authorization protocol. 2011. Technical report, IETF Internet Draft, 2011. [5] M. E. Ireland, R. B. Slatcher, P. W. Eastwick, L. E. Scissors, E. J. Finkel, and J. W. Pennebaker. Language style matching predicts relationship initiation and stability. Psychological Science, 22(1):39–44, 2011. [6] A. Mallet. Java-based packed for statistical nlp toolkit. Available at (accessed 26.01. 10), 2010. [7] Mit media lab’s immersion project. https://immersion.media.mit.edu/. Accessed: 2014-02-04. [8] K. G. Niederhoffer and J. W. Pennebaker. Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21(4):337–360, 2002. [9] Ohmage dsu 1.0 specification. https://github.com/openmhealth/developer/wiki/ DSU-1.0-Specification. Accessed: 2014-02-04. [10] F. Peng, D. Schuurmans, V. Keselj, and S. Wang. Automated authorship attribution with character level language models. In 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), pages 267–274, 2003. [11] J. W. Pennebaker, M. R. Mehl, and K. G. Niederhoffer. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1):547–577, 2003. [12] E. Riloff and W. Phillips. An introduction to the sundance and autoslog systems. Technical report, Technical Report UUCS-04-015, School of Computing, Figure 6: “Rhythm” Visualization University of Utah, 2004. [13] M. Swan. The quantified self: Fundamental disruption in big data science and biological discovery. Big Data, named entities and topics, all of which can be analyzed over 1(2):85–99, 2013. time, audience, etc. Industry-standard information extrac- [14] Y. Zhao, J. Zobel, and P. Vines. Using relative tion techniques such as autoslog [12] could be applied to entropy for authorship attribution. In Information discover meeting arrangements, events that occur to named Retrieval Technology, pages 92–105. Springer, 2006. entities or topics mentioned in the emails, and so on. Sen- timent analysis could be computed and exposed as another temporal stream, to attempt to model the user’s disposition as a function of time. Additional third-party applications are planned, such as a tool for determining points of inflec- tion in the descriptive statistics computed on the language model, and a tool to easily correlate other time-based data against the statistics streams. 6. REFERENCES [1] S. Bird. Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for Computational Linguistics, 2006. [2] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: an architecture for development of robust hlt applications. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 168–175. Association for Computational Linguistics, 2002. [3] D. Ferrucci and A. Lally. Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3-4):327–348, 2004.