<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Challenges for using social media for early detection of T2DM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dane Bell</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Fried</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luwen Huangfu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihai Surdeanu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen Kobourov</string-name>
          <email>kobourovg@email.arizona.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Arizona Tucson</institution>
          ,
          <addr-line>AZ 85721</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Twitter and other social media data are utilized for a wide variety of applications such as marketing and stock market prediction. Each application and appropriate domain of social media text presents its own challenges and benefits. We discuss methods for detecting obesity, a risk factor for Type II Diabetes Mellitus (T2DM), from the language of food on Twitter on community data, the peculiarities of this data, and the development of individual-level data for this task. This project is in aid of implementing a system that can detect individuals who are likely to be at high risk for preventable Type II Diabetes Mellitus (T2DM), a lifeshortening disease that generates fatal complications that is common in the developed world. The system is part of an effort to nudge (Thaler and Sunstein, 2008) individuals at risk for T2DM to make changes to their diet and exercise level to prevent or delay the disease's onset. The central hypothesis of this work is that (features of) individuals' tweets about food correlate with their real-world food consumption, which is in turn correlated with their likelihood of developing T2DM. We began by learning to detect obesity, a factor often implicated in the rising rate of T2DM diagnosis in the United States. Through work on communitylevel data, we found that this hypothesis was supported, but our machine learning model for detecting obesity rates at a state level proved not to transfer well to individuals. For this reason, we sought to engage with tweeters and other individuals on social media sites to help collect individual data through the use of a novel, 20-questions-style quiz generated semiautomatically from a classifier trained on community-level Twitter data. We discuss our approach to these challenges as well as future directions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Community-level detection</title>
      <p>
        In order to begin detecting obesity, we began with
models over community-level data, namely cities and US states.
Using the Twitter API, we gathered ca. 3.5 million tweets
containing relevant hashtags such as #dinner and
#breakfast, of which 16% (562,547 tweets) could be assigned a
location within a US state
        <xref ref-type="bibr" rid="ref3">(Fried et al., 2014)</xref>
        .
      </p>
      <p>
        As is typical in Twitter data, our tweets required significant
preprocessing, most importantly in removing Uniform
Resource Locators (URLs) and @mentions of user handles.
We experimented with different feature sets, including
limiting our features to hashtags, food words, or both. We also
used Latent Dirichlet Analysis
        <xref ref-type="bibr" rid="ref2">(Blei et al., 2003)</xref>
        to
mitigate sparsity, with 200 topics added to our feature set. In
all cases, we used Support Vector Machines (SVM) with a
linear kernel
        <xref ref-type="bibr" rid="ref6">(Vapnik and Vapnik, 1998)</xref>
        .
      </p>
      <p>A model was trained to predict whether a state was above
or below the national median for overweight rate according
to a Kaiser Commission on Medicaid and the Uninsured
(KCMU) analysis1. In addition to predicting
communitylevel obesity at an accuracy of 80%, this dataset was able
to predict whether a state had greater or lower than median
diabetes rate (69% accuracy). Similar models were able to
predict the less obviously related variables of location and
political party affiliation.</p>
      <p>Different sets of features were optimal for each of these
factors, sometimes favoring all words and sometimes only
food words, for example, although the addition of LDA
topics was beneficial in all cases. Table 1 shows the top 20
features of the SVM model for classification, displaying
intuitively appropriate correspondence between diets (fried vs.
vegan) and rate of obesity and diabetes.</p>
      <p>3.</p>
    </sec>
    <sec id="sec-3">
      <title>Transfer to individuals</title>
      <p>
        Although the previous experiments showed that tweets
about food contained information about our variables of
interest, the performance of the community-trained models
on manually annotated individual Twitter accounts was at
chance. This made it clear that a corpus of individually
annotated Twitter accounts was necessary for accurate
prediction, and we devised a 20-questions-style quiz site based
on our community-level data to serve two purposes:
evaluation on individuals, and data collection for new models.
SVMs do not produce models that are easily converted into
natural-language questions, but tree-based classifiers such
as random forest classifiers do. Through further
experimentation, we discovered that a small number of relatively
shallow decision trees with discrete features could perform
comparably to our prior SVM model when predicting
statelevel overweight rates
        <xref ref-type="bibr" rid="ref1">(Bell et al., 2016)</xref>
        . The high
performance of these models (78% accuracy, compared to 80%
of our previous work and baseline accuracy of 51%) came
in spite of their simplicity and interpretability: the best
performing model used a 7-tree decision forest with maximum
depth 3 and three-bin discrete features.
      </p>
      <p>
        These trees were converted semiautomatically into natural
languages questions, so that a feature based on the word
brunch became ”How often do you eat brunch?” with three
multiple-choice Likert scale
        <xref ref-type="bibr" rid="ref4">(Likert, 1932)</xref>
        answers such as
Practically never. Figure 1 illustrates one tree of the
decision forest. The questions that were asked depended
cru1http://kff.org/other/state-indicator/
adult-overweightobesity-rate/
overweight: +
overweight:
diabetes: +
diabetes:
      </p>
      <p>Highest-weighted features</p>
      <p>no
curry
&gt; 1
fruit
&gt; 1</p>
      <p>yes
yes
overweight
brunch
&gt; 1
no
not
overweight
yes
overweight
cially on which answers the quiz takers provided, as in the
tree’s binary choices based on three quantized values, 0, 1,
and 2. Notice that this conversion relies on our
underlying assumption that mentions of a food word are correlated
with the consumption of that food. The 20-questions-style
quiz based on these questions allowed us to evaluate how
well the community-trained classifier applied to
individuals, resulting in high accuracy (79%) which was
nevertheless lower than the baseline (82%) in our highly biased
sample. We interpret this to indicate that, as in our initial
experiments, individuals are not highly representative of their
regions (or vice versa), meaning that individual-level training
information is necessary for good individual classification.
Fortunately, by engaging with quiz takers, the site also
afforded an opportunity to collect individual-level training
data. Quiz takers optionally provided their (public) social
media accounts and their height and weight, from which
we can calculate body mass index (BMI), as well as other
demographic information: location, age, and gender. This
will allow data collection from these accounts with
permission to train more direct, individual-based classifiers.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        The greatest challenges for obesity detection are familiar
from other NLP work. There is the persistent problem of
non-human accounts (e.g., businesses, organizations, and
bots) which add noise to the training data. The signal
fighting against that noise is also imperfect, notably in its
sparsity, since the average user has on the order of hundreds
of tweets, of which only a very small percentage regard
food. However, tweets that do not mention food may still
be useful for obesity detection, much in the way that food
tweets can significantly predict political affiliation
        <xref ref-type="bibr" rid="ref3">(Fried et
al., 2014)</xref>
        through indirect cultural connections.
Future work will include taking more information into
account in the models. With individual-level data, we can
capitalize on users’ locations, photo, user handle, bio, and
age, all of which are informative, though optional, parts of
a Twitter profile. With these as well as features generated
from the tweets themselves, classifiers can be constructed
for intermediate factors such as gender which will in turn
add valuable features for obesity classification. This in turn
will improve our ability to develop a valuable public health
tool for detecting and preventing T2DM efficiently through
social media.
      </p>
      <p>5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fried</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huangfu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kobourov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Towards using social media to identify individuals at risk for preventable chronic illness</article-title>
          .
          <source>In LREC</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Fried</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobourov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hingle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Analyzing the language of food on social media</article-title>
          .
          <source>In 2014 IEEE International Conference on Big Data (Big Data)</source>
          , pages
          <fpage>778</fpage>
          -
          <lpage>783</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Likert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1932</year>
          ).
          <article-title>A technique for the measurement of attitudes</article-title>
          .
          <source>Archives of psychology</source>
          ,
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Thaler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sunstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Nudge: Improving Decisions about Health, Wealth, and Happiness</article-title>
          . Yale University Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V. N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Statistical learning theory</article-title>
          , volume
          <volume>1</volume>
          . Wiley New York.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>