<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Email Analysis Framework: Aiding the Analysis of Personal Natural Language Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Faisal Alquaddoomi</string-name>
          <email>faisal@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cameron Ketcham</string-name>
          <email>cketcham@cornell.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah Estrin</string-name>
          <email>destrin@cs.cornell.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cornell NYC Tech</institution>
          ,
          <addr-line>111 8th Avenue #302, New York, NY 10011</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UCLA Comp. Science Dept.</institution>
          ,
          <addr-line>4732 Boelter Hall, Los Angeles, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Free-form email text streams are a rich, yet seldom-tapped, source of information about an individual's internal state. The di culty in using this source of information is due partially to issues with obtaining and parsing these streams, and the sensitivity of the personal data they may contain. This work presents a framework for allowing a user to authorize the acquisition and processing of emails from their Gmail account in order to model the user's use of language. The framework exposes a RESTful HTTPS API for thirdparty apps to produce personal analytics for the user from their language model, over which the user maintains negrained control by selectively granting access via OAuth2. Candidate applications that consume the language models are discussed, including how they may derive personal analytics from the provided data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>As we interact with the world, we produce a profusion of
data across di erent modalities. Of particular interest is
the data we produce in communicating with other human
beings, which could if collected and analyzed provide insight
into our relationships with others as well our own internal
state. This data often takes the form of free text which by its
nature is qualitative, and thus challenging to analyze with
quantitative methods. It is also frequently strewn across
various services. Some of these services expose the data for
public consumption, as in the case of social networking sites
like Twitter, Facebook, or posts on personal blogs. Other
services are more private, such as email and text messaging,
and special care must be taken to gain access to the data as
well as to preserve its privacy.</p>
      <p>To summarize, the primary concerns are to securely collect,
integrate, and analyze this often sensitive qualitative data.
This paper proposes the implementation of a framework, the
"Email Analysis Framework" (EAF), that consumes a user's
sent email and produces a set of quantitative models and
statistics informed by the eld of natural language
processing. While the scope of the project is currently to collect
and process email, the intent is to expand the framework to
collect and integrate other sources of free text, for instance
from social networking sites. It is hoped that the EAF will
be used as a proxy for these qualitative data sources,
providing a foundation upon which user-facing tools can be built
to derive insights about this data for the individual in a
privacy-preserving way.</p>
      <p>The EAF is currently under active development, but an
alpha version of the tool is available1, as is the source code
2. This paper principally describes the structure and
design choices in acquiring, analyzing, and disbursing
sensitive data. Applications are discussed in 4, which currently
consist of a completed sample EAF consumer that produces
trivial visualizations as well as two more signi cant
applications that are currently in development.</p>
    </sec>
    <sec id="sec-2">
      <title>2. APPROACH AND RELATED WORK</title>
      <p>
        As described in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the overarching intent of quantifying
the self is to collect, integrate, and analyze data streams
that may be indicative of an individual's physical, emotional,
and psychological state. The purpose of this analysis is to
promote awareness of how these measurable quantities both
a ect and can be a ected by the individual's state, and to
provide support for decisions that change that state. As
mentioned previously, free text is both relatively easy to
collect and clearly carries much information about how we
feel about others and ourselves; indeed, it has been
demonstrated that even our choices of words re ect our
psychological state [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. While this data may be present, it is in an
opaque form that must be parsed into usable quantitative
data.
      </p>
      <p>
        The analysis of free text has been extensively addressed in
the eld of natural language processing (NLP). NLP
con1https://eaf.smalldata.io
2https://github.com/falquaddoomi/social_text_
processor/
cerns itself with the broad task of comprehending (that is,
unambiguously parsing) and extracting structured
information from human language, which is accomplished through
two main approaches: rule-based (aka grammatical) and
statistical methods. The EAF primarily makes use of these
statistical methods, speci cally n-gram language modeling,
to build a sequence of generative models of an individual's
use of language over time.
n-gram models are su ciently descriptive of an individual's
use of language that they can be used to discriminate one
author from another purely by comparing descriptive statistics
computed over them, such as the entropy or the perplexity
of the distributions [
        <xref ref-type="bibr" rid="ref10 ref14">10, 14</xref>
        ]. Descriptive statistics, such as
the entropy of a language model mentioned previously, are
of special appeal to privacy because they provide an
essential determination about the author without compromising
the original content from which the statistic was derived.
A user's email is a unique corpus in that each document (i.e.
email) is tagged with a host of metadata, including the time
it was sent. Thus, computing language models over brackets
of emails close in time can provide \snapshots" of the
evolution of a user's use of language over time. These snapshots
can be compared against each other to determine if there
are shifts in the style of the user's written communications
which could perhaps correlate to life events. There may be
regularities in the changes of these models, or similarities to
other people's models with whom the individual interacts.
The snapshots can be ltered by recipient or by
communication mode to determine if the audience or medium
determines the way an individual writes, or if there are detectable
groupings. Many more examples could be proposed for these
temporal language models, especially when other sources of
time-based data (location, activity, calendar events, etc.)
are introduced. One of the EAF's main goals is to provide
infrastructure to build and maintain these models, as well
as allow them, and the descriptive statistics derived from
them, to be released at the user's discretion for comparison
to other data sources.
      </p>
      <p>
        There are other frameworks which provide similar analytical
capabilities, notably the General Architecture for Text
Engineering (GATE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There are also numerous libraries and
toolkits [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ] that include the same features that the EAF
provides { in fact, the EAF makes use of the popular nltk
library [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to perform many of its functions. The EAF di ers
from these projects in its context: it is a deployable system
focused on centralizing the secure acquisition and processing
of emails for many users. It provides user-facing
administrative interfaces to control it, and app-facing APIs to make
use of its results. The EAF's intent is to allow users to make
sense of their own data, and uses a ne-grained opt-in
permission system fully controlled by the user to help protect
against malicious or unintended use of the user's email data.
In the context of email analysis, the MIT Media Lab's
Immersion project[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] shares the EAF's goal of using one's email
for the purpose of personal insight and self-re ection. Unlike
the EAF, the Immersion project restricts itself to analysis of
the user's social group through reading the \From" and \To"
elds of email header { no examination of the body text is
performed. Further, the output of the Immersion project
is an infographic and not raw data that can be reused by
other components, whereas the EAF's purpose is to
facilitate analysis by other tools.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. ARCHITECTURE</title>
      <p>
        The EAF's rst task is to transform a user's sent email
messages into a series of tokens, where each token is tagged with
the time at which it was sent. This series of time-tagged
tokens constitutes a \stream", from which the n-gram models
mentioned previously are built. The stream is quantized
into intervals; the ordering of tokens within these intervals
is not preserved from their originating messages (outside
of their order in the n-grams), with the express intention
of making it di cult to reconstruct the original text.
After quantization, the stream is then made available at the
user's discretion to third-party applications (\consumers"),
with the ability for the user to con gure per-consumer
lters that control what information that consumer can access.
A few candidate consumers are discussed in the
\Applications" section 4. In order to mitigate the danger of storing
sensitive user credentials, the EAF makes extensive use of
the OAuth2[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] standard, both as an OAuth2 consumer (of
Gmail, currently) and as an OAuth2 provider. The use of
OAuth2 also allows the user the freedom of revoking access
to the EAF should they wish to discontinue its use, or to
revoke access to third-party apps that had been authorized to
consume the EAF's API. After the initial synchronization,
future emails that the user sends are automatically acquired
by the system by periodically polling the provider.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Structure</title>
      <p>The EAF consists of three main components, as depicted in</p>
      <p>gure 1: a web interface through which the user authorizes
access to their Gmail account and performs administrative
tasks, a task processor which acquires the user's email and
produces a token stream from it, and a second web interface
which faces consumers of the token stream. Both web
interfaces are implemented in Django 1.7, a framework for rapid
development of web applications in Python. Authorization
to third-party services is facilitated by Django-Allauth, a
project that allows Django's built-in authentication system
to interoperate with a host of OAuth2 providers,
including Gmail. The task processor makes use of Celery, a
distributed task queue processor that is often used in concert
with Django. Both components communicate via a shared
database, speci cally PostgreSQL, which was chosen for its
performance under heavy, highly concurrent loads.</p>
      <p>The framework exposes a RESTful HTTPS interface to
allow third-party applications to consume the token stream.</p>
      <p>
        The implementation of this interface was aided by the
DjangoREST-framework, and the speci cations of the interface
follow the openmHealth DSU speci cation v1.0, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
userfacing web interface makes use of the RESTful interface
itself for querying the token stream. In order to allow
registered third-party sites to gain access to the user's email data
for analysis and visualization, the EAF acts as an OAuth2
provider; third-party sites must involve the user in their
request for a temporary access token, which they can
subsequently use to make requests on the user's behalf.
Prior to using the system the user rst creates an EAF site
account which acts as an aggregation point for the multiple
email accounts they might want to use. At the moment this
account creation is performed automatically when the user
authorizes their rst email account; the account they
authorize (or any other linked account) then implicitly logs them
in to their site account, although this behavior is subject to
change in the future.
      </p>
      <p>In their interaction with the system, the user proceeds through
three stages:
1. Authorization, in which the user is prompted to
release temporary credentials used to access their email
account via OAuth2.
is periodically updated. The process can be quite lengthy,
especially in the case where there is a large backlog of
messages to process, so the user is permitted to close the view
and return to the site at their convenience to check in on
the task's progress. Upon completion, the framework sends
a noti cation email which includes the total duration of the
acquisition task. At this point, the user can view the results
of the acquisition process in a small built-in visualization
dashboard that shows a few summarizing statistics about
their token stream plotted over time. Incremental
acquisition tasks that occur after the initial acquisition do not
trigger a noti cation.
2. Acquisition, during which the user monitors the progress
of the system as it downloads their emails and
performs ltering/transformations before inserting them
into the database as a stream of tokens.</p>
      <p>Since the framework is intended to model the user's use of
language and not the language of other individuals with
whom the user is conversing, it is necessary to strip
quotations and reply text from the emails prior to processing.
3. Release, in which the user selects which consumers Isolating only the user's text in sent mail is accomplished
can access their token stream and what ltering/transformatitohnrsough an adapted version of the email reply parser 3
liwill be applied for that consumer. brary, developed by GitHub.</p>
      <sec id="sec-4-1">
        <title>3.2.1 Authorization</title>
        <p>The authorization stage is initiated when the user visits
the web interface. Using a standard OAuth2 handshake,
the user is redirected to Google's Gmail authorization page,
where they log in (or use a previously logged-in session) and
then accept the permissions which the framework requests,
speci cally access to the user's email. If the user provides
their consent, they are returned to the EAF where they can
proceed to acquisition. If the user does not provide consent
or some other error occurs, they are returned to the
framework with an error message and are prompted to try again.
Multiple email accounts can be associated with a single EAF
site account, in which case selecting an account from the list
of registered accounts begins the next stage, acquisition.
3.2.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Acquisition</title>
        <p>Initial Acquisition. Acquisition starts immediately after
authorization and is handled by the background task
processor. The user is shown a view of the task's progress which
Ongoing Acquisition. In the background task processor,
the acquisition task consists of using the previously-obtained
OAuth2 credentials to authenticate to Google's IMAP server.
The task then proceeds to download the user's sent email
(that is, the contents of \GMailn[Sent Mail]") in
chronological order, skipping messages which have been recorded as
processed in a previous iteration of the task. Each email is
passed through a series of lters, called the \pre- lter chain",
which ultimately results in a sequence of tokens that are
associated with the email account, the user's EAF site account,
and the time at which the email was sent. By default, the
rst lter in the chain performs tokenization: each email
is split on newlines and punctuation into tokens, which are
converted to lowercase to reduce the number of possible
tokens due to capitalization di erences, and stripped of
numbers and quotation marks. The second lter is the \ignored
words" lter, which allows the user to selectively prohibit
certain words from ever entering the database. At the
mo3https://github.com/github/email_reply_parser
ment, the ignored words must be manually entered, which
makes ltering passwords and other sensitive information
problematic, given that the ignored list itself is then
sensitive. This will be addressed in the subsection on lter types,
3.3.</p>
        <p>After the lter chain runs, the tokens are then written to
the database. Rather than store repeated tokens
individually, each token is stored with a count of the number of times
it occurred within its message. If same token occurs in di
erent messages, it is stored separately for each message. This
choice was made as a compromise between allowing for
exible choice of the interval into which tokens are combined
when the stream is consumed and consuming less space in
the database; if the system were designed with a xed
interval rather than a exible one, the tokens would simply be
combined into a histogram for each interval.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.2.3 Release</title>
        <p>Once the user has found an EAF-compatible application,
they can authorize that application to access their token
stream via OAuth2. In this stage the EAF acts as an OAuth2
provider, providing a page to which the third-party
application can redirect the user to be prompted for authorization
of their identity via Gmail (used also as credentials to access
their EAF data) and permission to access their token stream.
In the case where the user has multiple token streams, they
will be prompted to choose the streams to which they are
granting access. On this page, the user selects a lter chain
for each stream that will be used speci cally with this
consumer, or simply opt not to lter the stream at all. The
process is detailed in gure 4.</p>
        <p>After this point, the consumer can request updates from
the token stream at any time. The EAF audits all accesses,
displays the last time of access, and allows the user to revoke
the consumer's access at any time or change the lter chain
associated with that consumer.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Filtering</title>
      <p>As previously mentioned, both the acquisition and release
stages employ a sequence of lters that allow the input
data to be selectively screened for sensitive terms and
otherwise transformed to better suit the requirements of the
consumers. The acquisition stage's lter chain is referred to
as the \pre- lter chain" and the release stage's is the
\postlter chain". There is only a single pre- lter chain, but there
can be as many as one post- lter chain for each registered
consumer.</p>
      <p>The pre- lter chain always has a special \tokenize" lter as
its rst lter, which produces the initial sequence of tokens
for ltering and transformation, and may only be used in
the pre- lter chain. A second special lter that may only be
used in the pre- ltering step is the \ignore word sequence"
lter, which ignores the sequence of tokens con gured in the
lter, and was initially created to ignore signature blocks.
This lter can only function in the pre- ltering step as the
exact sequence of the tokens is lost upon insertion into the
database.</p>
      <p>Aside from the special \tokenize" lter, there are a few other
lters which can only be used in the pre- ltering step, namely:</p>
      <sec id="sec-5-1">
        <title>Parts-of-Speech Tagger, which replaces each token with its detected part of speech (noun, verb, etc.)</title>
      </sec>
      <sec id="sec-5-2">
        <title>Fork, which produces an identical stream to the current one, but with its own sub- lter chain. The tokens that are produced from a fork are tagged with a unique ID corresponding to that fork.</title>
        <p>The \fork" lter is especially useful in conjunction with the
part-of-speech tagger, as both the original text and the
partsof-speech stream can be either individually released or
released together, which allows for analysis of the user's
grammar. Note that the parts-of-speech stream does preserve the
order of tokens in the original stream, but not the text of
the tokens themselves.</p>
        <p>The lter framework is modular, with the potential to add
new lters easily in the future. At the moment, a few
parameterizable lters are implemented to change the case of
tokens, strip speci c characters, and to remove words that
are either on a user-speci ed list or not contained within
aspell's \en US" dictionary. Detecting and ignoring named
entities is a work in progress.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Change Case, which transforms the case of the tokens;</title>
      </sec>
      <sec id="sec-5-4">
        <title>Strip Characters, which can be used to remove numbers and other special characters;</title>
      </sec>
      <sec id="sec-5-5">
        <title>Ignore Listed Words, which removes tokens on an \ignore" list from the token stream; and</title>
        <p>Ignore Non-Dictionary Words, which removes
tokens not found in a common English dictionary
By utilizing the \ignore words" lters, the user is allowed
ne-grained control of both the contents of the EAF's database
and the views of the token streams presented to di erent
consumers.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. APPLICATIONS</title>
      <p>As mentioned, third-party applications can gain temporary
access to a user's data for the purpose of visualizing or
otherwise processing it. Granting this access is currently at
the user's discretion; the user should make use of the
perconsumer post- lter controls to limit the release of
sensitive information to potentially untrustworthy parties.
Consumers sign requests to the EAF's RESTful JSON API with
an access token, obtained through the process described in
the \Release" section above 3.2.3.</p>
    </sec>
    <sec id="sec-7">
      <title>4.1 Example: Mail Visualization Front-End</title>
      <p>In order to demonstrate the functionality of the framework,
a visualization front-end was developed that consumes the
framework's API and produces a few example visualizations.
The front-end also serves as a reference implementation of
EAF client authentication; it rst requests permission from
the user before gaining access to their token stream. The
visualization front-end currently o ers the following modules:</p>
      <sec id="sec-7-1">
        <title>Word Cloud - a \word cloud" infographic that can be viewed on a per-week basis ( gure 5).</title>
      </sec>
      <sec id="sec-7-2">
        <title>Rhythm - a table of the days of the week as the</title>
        <p>columns and hours of the day as the rows is colored
according to the number of emails sent within each
interval, with darker colors corresponding to more emails
sent (a heatmap, essentially; gure 6).</p>
      </sec>
      <sec id="sec-7-3">
        <title>Alters - a bar chart of the number of people contacted</title>
        <p>per week; when a week is selected, displays a bar chart
of emails sent to each person.</p>
        <p>These visualization modules are intended to be a
jumpingo point for more useful visualizations to be constructed,
which would ideally incorporate data from other sources to
strengthen inferences about the user's overall state.
t]</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4.2 Pulse and Partner</title>
      <p>
        In addition to the sample application discussed above, our
group is currently developing two applications that make use
of statistics computed against the user's email. The rst is
\Pulse", which makes use of location traces from the user's
smartphone as well as the frequency and variety of
individuals with whom one communicates to compute a score that
indicates how rather than what the individual is doing.
This score is visualized as a waveform over a short window
of time (i.e. a week), which can be shared with family
members and friends. The second is \Partner", which is intended
to measure the degree to which linguistic style matching
occurs among individuals who interact with each other face to
face, a fairly well-documented phenomenon [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Partner
makes use of the location traces of two or more individuals
as well as computed statistics over their emails to produce
two scores, a \proximity" and a \language-style matching"
score, which will be visualized as individual timeseries. A
third timeseries will display their computed correlation over
time.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5. CONCLUSION, FUTURE WORK</title>
      <p>
        The Email Analysis Framework, a system for extracting
structured, easily-consumed data from time-tagged
naturallanguage text was proposed in this work. At the moment it is
limited to acquiring text from Gmail, computing, and
exposing language models to other tools via a RESTful HTTPS
API, but it is hoped to be extended to other sources of
personal natural-language text, such as Facebook and Twitter
streams. A few candidate visualizations were described to
both demonstrate how the data could be used and to
stimulate investigation into more novel applications.
In terms of future work, there are extensions planned to
all the stages of the framework. As mentioned, the scope of
providers is intended to be expanded to other text providers,
which will allow analysis to be performed on how di erent
media a ect the language model. Additional streams can
be extracted in the processing phase, such as identifying
named entities and topics, all of which can be analyzed over
time, audience, etc. Industry-standard information
extraction techniques such as autoslog [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] could be applied to
discover meeting arrangements, events that occur to named
entities or topics mentioned in the emails, and so on.
Sentiment analysis could be computed and exposed as another
temporal stream, to attempt to model the user's disposition
as a function of time. Additional third-party applications
are planned, such as a tool for determining points of in
ection in the descriptive statistics computed on the language
model, and a tool to easily correlate other time-based data
against the statistics streams.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In Proceedings of the COLING/ACL on Interactive presentation sessions</source>
          , pages
          <volume>69</volume>
          {
          <fpage>72</fpage>
          . Association for Computational Linguistics,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Tablan</surname>
          </string-name>
          .
          <article-title>Gate: an architecture for development of robust hlt applications</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics</source>
          , pages
          <volume>168</volume>
          {
          <fpage>175</fpage>
          . Association for Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Lally</surname>
          </string-name>
          .
          <article-title>Uima: an architectural approach to unstructured information processing in the corporate research environment</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>10</volume>
          (
          <issue>3-4</issue>
          ):
          <volume>327</volume>
          {
          <fpage>348</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hammer-Lahav</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hardt</surname>
          </string-name>
          .
          <source>The oauth2</source>
          .
          <article-title>0 authorization protocol</article-title>
          .
          <source>2011. Technical report, IETF Internet Draft</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Ireland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Slatcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Eastwick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Scissors</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Finkel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          .
          <article-title>Language style matching predicts relationship initiation and stability</article-title>
          .
          <source>Psychological Science</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <volume>39</volume>
          {
          <fpage>44</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallet</surname>
          </string-name>
          .
          <article-title>Java-based packed for statistical nlp toolkit</article-title>
          .
          <source>Available at (accessed 26.01. 10)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] Mit media lab's immersion project</article-title>
          . https://immersion.media.mit.edu/. Accessed:
          <fpage>2014</fpage>
          -02-04.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K. G.</surname>
          </string-name>
          <article-title>Niederho er and</article-title>
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          .
          <article-title>Linguistic style matching in social interaction</article-title>
          .
          <source>Journal of Language and Social Psychology</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <volume>337</volume>
          {
          <fpage>360</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Ohmage dsu 1.0 speci cation</article-title>
          . https://github.com/openmhealth/developer/wiki/ DSU-1.
          <fpage>0</fpage>
          -Specification. Accessed:
          <fpage>2014</fpage>
          -02-04.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Keselj</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Automated authorship attribution with character level language models</article-title>
          .
          <source>In 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL</source>
          <year>2003</year>
          ), pages
          <fpage>267</fpage>
          {
          <fpage>274</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Mehl</surname>
          </string-name>
          , and
          <string-name>
            <surname>K. G.</surname>
          </string-name>
          <article-title>Niederho er. Psychological aspects of natural language use: Our words, our selves</article-title>
          .
          <source>Annual review of psychology</source>
          ,
          <volume>54</volume>
          (
          <issue>1</issue>
          ):
          <volume>547</volume>
          {
          <fpage>577</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rilo</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Phillips</surname>
          </string-name>
          .
          <article-title>An introduction to the sundance and autoslog systems</article-title>
          .
          <source>Technical report, Technical Report UUCS-04-015</source>
          , School of Computing, University of Utah,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Swan</surname>
          </string-name>
          .
          <article-title>The quanti ed self: Fundamental disruption in big data science and biological discovery</article-title>
          .
          <source>Big Data</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>85</volume>
          {
          <fpage>99</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vines</surname>
          </string-name>
          .
          <article-title>Using relative entropy for authorship attribution</article-title>
          .
          <source>In Information Retrieval Technology</source>
          , pages
          <volume>92</volume>
          {
          <fpage>105</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>