Answering What If, Should I
                         and Other Expectation Exploration Queries
                        Using Causal Inference over Longitudinal Data
                                  Emre Kıcıman                                               Jorgen Thelin
                              Microsoft Research                                           Microsoft Research
                             emrek@microsoft.com                                         jthelin@microsoft.com
ABSTRACT                                                               their illnesses and coping strategies [8, 13]. People report and share
Many people use web search engines for expectation exploration:        this information for many reasons: keeping in touch with friends,
exploring what might happen if they take some action, or how           gaining social capital, diary-keeping, or even helping others. And
they should expect some situation to evolve. While search engines      with increasing use of personal sensors and devices, from exercise
have databases to provide structured answers to many questions,        trackers to health monitors, such data streams are becoming more
there is no database about the outcomes of actions or the evolution    regular, more detailed and more reliable [4, 26, 32]. These longitu-
of situations. The information we need to answer such questions,       dinal data streams, in aggregate, capture a rich set of relationships
however, is already being recorded. On social media, for example,      between the situations in which people find themselves, the actions
hundreds of millions of people are publicly reporting about the        they choose to take, and the outcomes they experience.
actions they take and the situations they are in, and an increasing        We describe Outcomes Engine, a system for analyzing such large-
range of events and activities experienced in their lives over time.   scale longitudinal data to characterize how situations evolve over
Here, we show how causal inference methods can be applied to such      time, and to capture the consequences of people’s actions. Given a
data to generate answers for expectation exploration queries. This     query representing some target action T , Outcomes Engine iden-
paper describes a system implementation for running ad-hoc online      tifies individuals who have reported doing T , and compares their
causal inference analyses. The analysis results can be used to gen-    subsequent experiences to peers who did not report doing T . This
erate pros/cons lists for decision support, timeline representations   comparison results in an expectation map detailing “what changes
to show how situations evolve, and be embedded in many other           to expect” over time due to T . A key aspect of Outcomes Engine
decision support and planning applications. We discuss potential       is its use of causal inference methods to compare the two sets of
methods for evaluating the fundamental quality of inference results    individuals so as to isolate the specific consequences of T from
and judge the short-term and long-term usefulness of information       subsequent experiences that are correlated with, but not due to T .
for users.                                                                 The expectation maps generated by Outcomes Engine are an
                                                                       important building block for a wide variety of data-driven search
1     INTRODUCTION                                                     and decision-support applications—from automatically generating
                                                                       decision aids, such as pros and cons lists, to helping individuals
Everyone, at some point in their lives, finds themselves in an un-     ground their experiences in how a situation is likely to evolve over
familiar situation, considering what they should do, and trying        time (cf Figure 1). In addition, expectation maps may be useful for
to understand what to expect of the future. We see such expecta-       policy makers’ and scientists’ explorations across a variety of do-
tion exploration occurring in web searches, with people exploring      mains. In this paper, we discuss our approach and prototype system,
possible consequences of their choices and the outcomes of situa-      several application scenarios, as well as evaluation challenges and
tions. These explorations cover both consequential topics, such as     strategies.
life-changing education and career choices (e.g., “Should I join the
military?”) or major financial and personal decisions (e.g., “Should
I move to California?”); as well as more quotidian topics, such as
the consequences of purchase decisions, athletic training regimens     2 BACKGROUND AND RELATED WORK
and dating rituals.                                                    2.1 Expectation Exploration Tasks
    The answers to these questions are not readily available in a      Exploring expectations on the Internet plays an important role in
knowledge base or Wikipedia. But, the information necessary to         people’s planning, decision-making, and forecasting for both ev-
answer these questions is already being recorded on social media,      eryday and extraordinary scenarios. These explorations encompass
where hundreds of millions of individuals regularly and publicly       a broad variety of tasks, including explorations of hypothetical,
report their personal experiences, including the situations they       ongoing or past problems, or seeking informational support, emo-
are in, the actions they take, and the experiences they have after-    tional satisfaction, or preparation for a future event. Taxonomies
wards. For example, people talk about work or relations [12, 15]       of web search activities classify these as an information gathering
health and dietary practices [1, 38], and even log information about   task, which encompass 35% to 80% of people’s web searches [17, 36].
                                                                       Expectation exploration may also be considered as a temporal web
DESIRES 2018, August 2018, Bertinoro, Italy                            query, where time is relative to individually experienced timelines,
© 2018 Copyright held by the author(s).                                rather than, for example, a calendar date or global event [5, 7].
DESIRES 2018, August 2018, Bertinoro, Italy                                                                                                    E. Kıcıman et al.


         Do people buy new cars after a raise?                 Should I get a dog?                         Hey, I sprained my ankle badly
         Luxury cars             Small cars                Pros                Cons
                                                                                                           When will I play football again?
                                                           Love the dog        Early wake up
                                                           Enjoy walks         Scratched furniture             People start to mention
                                                                                                            playing football after 8 weeks
                                                           …                   …
                   (a) Timeline answer                            (b) Pros / cons list                        (c) Conversational agent


  Figure 1: Interface mockups: Expectation exploration tasks may be satisfied with a variety of information presentations.


   Decision-making processes in particular depend critically on              2.2      Causal Inference
such information gathering—especially in unfamiliar situations—              In this paper, we propose to analyze individual-level longitudinal
where the web augments more conventional information sources                 datasets with causal inference methods to directly identify what
such as professional and friends’ advice, training, etc. In 2004, Rose       can be expected following some action or individual experience.
and Levinson measured advice-related searches as 2-5% of web                 We believe this can provide a semi-structured representation of
search tasks [34]. Bailey et al. find that decision-related tasks—           expectations that can be used in a wide variety of ways to aid
including comparing ( 9%) and planning ( 2%)—constitute a signif-            individual’s planning, decision-making, and forecasting.
icant portion of overall web tasks. Lagan et al. find that even in              Because we are interested in using our analysis results to aid
pregnancy—a scenario with dedicated information infrastructures,             decision-making—essentially an intervention—our goal is funda-
related health professionals and care programs—over 80% of women             mentally one of causal inference. While we do not believe we can
used web search to help make decisions [23].                                 achieve the ideal identification of causal relationships, we can use
   Though there are online resources and crowdsourced methods                methods borrowed from the causal inference literature to reduce
for exploring some scenarios, extracting outcomes from aggregated            the bias of naïve correlational analyses. Here, we give a brief intro-
personal data streams has many distinct advantages [22, 25] First,           duction to potential outcomes, one framework for causal reason-
results are grounded in the real experiences of users who have               ing [35].
taken an action, potentially leading to more reliable results than              In the potential outcomes framework, whether some experi-
simply reading advice from web pages. Second, a question may                 ence “causes” an outcome is computed by comparing two potential
be too rare for someone to have devoted writing advice about it,             outcomes: one outcome Yi (T = 1) after a person i has a target
but there is still plenty of social data to answer via data mining.          experience T 1 , and another outcome Yi (T = 0) when the same
For example, someone may ask whether to move to one city vs.                 person in an identical context does not have the experience. The
another. Web pages may exist to answer such a question for some              causal effect of T is then Yi (T = 1) − Yi (T = 0). Of course, it is
city pairs, but not for all. In contrast, we need only look at social        impossible to observe both Yi (T = 1) and Yi (T = 0) for the same
postings from people who have moved to one city vs. the other                individual i. Once we observe i having the experience or not, we
and compare their postings to see the relative benefits of each.             cannot observe the other, counterfactual outcome.
Third, an answer may be contextually dependent on the asker. The                Thus, the problem of causal inference is, in a sense, a problem of
methods presented in this paper can potentially be extended to               missing data, and causal inference techniques attempt to address
provide answers personalized to the asker.                                   this challenge by estimating the missing counterfactual outcome for
   Once an expectation map has been extracted for a scenario, it             an individual based on the outcomes of other, similar individuals. A
can be embedded in many distinct presentations and applications              common method for estimating missing counterfactual outcomes
to provide the asker with a high-level overview of the implications          is to find pairs (generalizing to groups) of individuals in the ob-
of a choice or evolution of a situation. For example, a timeline view        servational data whose covariates are statistically very similar to
may show how outcomes evolve over time (Figure 1-a). Another                 one another, but where one has received a treatment and the other
application, specifically for decision support, is an automatically          has not. Each individual’s matched partner then provides the basis
generated pros/cons list [20] (Figure 1-b). The resultant data could         for estimating a counterfactual outcome for that individual. We
also be used within a conversational agent (Figure 1-c).                     describe our specific method in Section 3.2.
   While our work may benefit individuals who wish to understand                Prior research demonstrates the feasibility of this approach in
their situations and the possible implications of their actions, there       high-dimensional settings (such as our proposed analysis of social
is also an opportunity to use this kind of analysis to better under-         media and sensor data). For example, Eckles and Bakshy reduced
stand behavioral phenomena of societal importance, third-party               bias in an observational study by 97% compared to a naive analysis,
interventions and other policy questions. As well, while we focus            as measured against a gold-standard randomized field experiment,
on analysis of timelines of individual people’s experiences, such            by conditioning on high-dimensional covariate data [11].
analyses may also be applied to event timelines of other kinds [2],
subject to sufficient data availability and assumptions.                     1 In medical and social sciences literature, the target experience is often called the
                                                                             treatment, and is compared to a control or placebo experience. Following this convention,
                                                                             we will use the terms treated group and control group in this paper.
Answering Expectation Exploration Queries using Causal Inference                                DESIRES 2018, August 2018, Bertinoro, Italy


2.3    Social and Online Data Analyses                                   Input Query. Asking a question to explore expectations following
Longitudinal studies of online data, including social media data         an action or event requires identification of individuals who have
and search query logs, have proven effective in helping understand       performed a particular action or experienced a particular situation.
the behaviors of people in various situations. These studies have        The pattern for identifying messages about this experience, then,
been targeted to explore and understand how situations evolve over       is the fundamental input query we expect. Our prototype relies on
time, identify predictive factors involved in positive and negative      explicit textual mentions of actions and situations and, in our design,
outcomes, and help identify at-risk individuals. For example, using      we allow a boolean phrase query, with some wildcard support, for
search query logs, Paul et al. [31] characterize the information seek-   identifying a targeted experiential phrases.
ing behavior during various phases of prostate cancer. Fourney et        Expectation Maps. Expectation maps represent the time-varying
al. [14] align search query logs with the natural clock of gestational   effects of an experience or treatment over a population of people.
physiology of pregnant women to characterize their changing in-          An expectation map for a treatment can be represented as a 2D
formation needs. Althoff et al. study 5 years of fitness tracking data   matrix, where each row is an outcome word or topic, each column
to better understand social influence on physical activity [3].          represents an epoch of time (e.g., hours or days since treatment).
    By mining social media, De Choudhury et al. [9] find behavioral      Each cell represents the effect of the treatment on a specific outcome
cues useful to predict the risk of depression before onset. Simi-        during a specific epoch. The effect itself includes measurements
larly, by leveraging these naturalistic data, prior work examined        of effect size and statistical significance, and can be extended to
how dietary habits vary across locations [1]; the links between          include details of heterogeneous effects.
diseases, drugs, and side-effects [27, 30]; links between actions and
outcomes [20]; shifts in suicidal ideation [10]; and how alcohol
usage in early college affects long-term outcomes [19]. Olteanu et       3.2    Causal Inference Method
al. demonstrate propensity scored analysis of social media timelines     In our system, we use a stratified propensity score analysis to esti-
to understand outcomes across a broad set of domains [28]                mate missing counterfactual outcomes by identifying matching sub-
                                                                         populations of individuals with similar distributions of covariates,
3     CAUSAL INFERENCE-BASED MAPPING OF                                  but with differing treatment status. Given a set of social media mes-
      EXPECTATIONS                                                       sages, we apply a preprocessing step to generate a set of per-user
We present our approach to mapping expectations from social              timelines. Once a query is issued, we identify the users that have
datasets. First, we present our basic design data requirements and       mentioned the treatment experience and place them in a treated
assumptions, followed by our definition of a query and result repre-     group, and place all other users in a control group. We align user
sentation. Then, we present our method for extracting expectations       timelines based on when the individual mentioned experiencing the
by applying causal inference over social data sets. We use causal in-    treatment. We align the control users based on a random “placebo”
ference for this purpose to remove merely correlated outcomes and        time. To reduce the effects of temporal biases, we assign placebo
focus on outcomes directly caused by an action or treatment. This        times to match the distribution of treatment times.
is particularly important for applications that will be performing in-       Stratification is achieved by estimating each individual’s likeli-
terventions (including decision-support applications for individuals     hood of being in the treated group using a propensity score model.
and policy makers)                                                       This is a learned function that infers likelihood of being in the
                                                                         treated group as a function of a set of covariates (i.e., individual
3.1    Basic Design                                                      properties and past tweets that might influence both treated/control
                                                                         status and outcomes). Individuals with similar propensity scores
Data. The fundamental requirements our approach places on data is
                                                                         are grouped into strata. In aggregate, individuals within a strata are
that they provide a longitudinal view of the actions and experiences
                                                                         likely to have similar covariates, allowing us to isolate and estimate
of individuals. Thus, at a minimum, input data observations must
                                                                         the effects of the treatment itself within each strata. Note that the
include a user id and datetime in addition to observational content
                                                                         primary purpose of the propensity score model is to identify groups
(e.g., message text).
                                                                         of individuals with similar covariates—the accuracy of predicting
   We focus our prototype implementation on social media data for
                                                                         group status is secondary. To ensure the quality of counterfactual
several reasons. First, social media data provides high-dimensional
                                                                         estimates, the method drops strata that have either too few treated
and cross-domain coverage, allowing a broad variety of query topics
                                                                         or too few control users. Outcomes are aggregated across remain-
and increasing the likelihood of observing statistical confounders
                                                                         ing strata, weighted by the size of the treatment population in the
that would otherwise bias an analysis. Secondly, the textual nature
                                                                         strata, to estimate the average effect of treatment on the treated
of social media data is relatively interpretable. Third, social media
                                                                         population.
data is available at large-scale and captures individual activities
                                                                             The details of our analysis are as follows:
over long periods of time. Beyond social media, our framework may
                                                                         Covariate and outcome features: The content of social media
be applied to other kinds of data sources. E.g. personal sensors and
                                                                         messages from before the treatment (or placebo) time, as well as
other services may be supported, though treatment identification
                                                                         other user properties (posting frequencies, message lengths, pro-
and result interpretation in our framework would require adap-
                                                                         file information, etc.) are extracted as covariates—potentially con-
tation. Search query histories are particularly promising, as past
                                                                         founding features that may influence both treatment status and
analyses have demonstrated the potential for longitudinal analysis
                                                                         outcomes. The content of social media messages after the treatment
of search histories [3, 14, 29, 33].
DESIRES 2018, August 2018, Bertinoro, Italy                                                                                          E. Kıcıman et al.


(or placebo) are extracted as the time-varying outcome measures                             Arrays of token occurrence timestamps
of the treatment.
   We represent social media message content in our covariate and                w0
outcome features as empirical, unsmoothed word likelihoods. We
                                                                                 w1        t0     t1   ...   ti    ...   tk-1   tk
limit our word distributions to the top 50k unigrams in our cor-
pus. We do not remove stopwords, stem or normalize the text, and                 ...
use whitespace and punctuation to identify word-breaks. Option-                      w1 occurs i-1 times          w1 occurs k-i times
ally, given a word-to-topic mapping, we combine outcome word                       in covariate window            in outcome window
likelihoods to generate the total topic likelihood.
Propensity score modeling: We implement our high-dimensional                                    treatment/placebo time
propensity score analysis as a logistic regression with 10-fold cross-
validation. Our analysis divides users into 100 strata, removes strata                     Figure 2: Timeline data structure
with either or both too few Treated or too few Control users. In
practice, this removes the lowest-propensity strata and the highest-
propensity strata, leaving the middle strata in these analyses. The         nodes. Then, it applies a supervised algorithm to learn a model
outcome differences in these remaining strata are weighted accord-          of the propensity of users to be treated. This learned model is
ing to the Treated population distribution and combined to estimate         distributed across all the data nodes.
the average treatment effect on the Treated group.                          Timeline Server. The Timeline server stores, for each user, a com-
   While we borrow propensity score analysis from the causal in-            pressed representation of the timeline of token occurrences (the
ference literature, our application of this technique is not a causal       unigrams, bigrams, or phrases mentioned by users). Given a treat-
analysis, as two key assumptions may not hold: First, all confound-         ment (or placebo) time for a user, the timeline server can quickly
ing variables must be included in the observed covariates. Yet, while       return a summary representation of covariates, or a summary rep-
high-dimensional propensity score analyses, such as ours, are more          resentation of outcomes. Figure 2 shows a sketch of the simple
likely to capture those variables correlated with confounding vari-         timeline data structure. For each token that has been used by a user,
ables, it is difficult to argue that all relevant aspects of individuals’   we use a binary search to identify the array index of the treatment
lives are captured in their Twitter streams. Second, the stable unit        time, and compute the number of occurrences of the token from
treatment value assumption (SUTVA) must hold—that is, one per-              the index value. Simple extensions allow us to calculate the number
son’s outcome must be independent of whether another person had             of occurrences within arbitrary time windows.
the target experience. Additional domain knowledge is required to           Outcome Aggregator. The Outcome aggregator is responsible for
assert these assumptions.                                                   gathering the partially aggregated outcomes from data nodes, iden-
                                                                            tifying strata to drop due to lack of comparable subpopulations, and
                                                                            performing a weighted aggregation of outcomes across remaining
4    OUTCOMES ENGINE ARCHITECTURE                                           strata. In addition, the Outcome aggregator runs diagnostics on the
To execute online ad-hoc causal inference analyses over large-scale         analysis, such as covariance balance and other validity tests.
datasets, we must provide scalable implementations for treatment            Request flow. As shown in Figure 3, when a request arrives from
identification, covariate and outcome extraction, and propensity            an application to the query node, the query node first forwards
score modeling. We use a two-tiered approach to our cluster design:         the query to all data nodes (step 1), where the Treatment ID server
1) User data is distributed randomly across data nodes, with all            identifies the treated and control groups and individuals’ treatment
data from a single user assigned to a single node. Each data node           and placebo times (step 2). Then, each Timeline server featurizes
consists of a Treatment Identification server and a Timeline server.        the covariates for these users and returns these covariates and their
2) A centralized query node is responsible for distributing queries         treated/control labels to a Model Builder in the centralized query
across all data nodes, centralized building of the propensity score         node (step 3). If the treatment and control groups are very large,
model, and aggregating stratified outcomes.                                 they can be downsampled to improve end-to-end performance.
Treatment ID Server. The Treatment ID server provides an index                 The Model Builder collects these covariate and label data from
over the full text of text messages. Given a query (the treatment           all the replication nodes, dynamically learns a propensity score
identification pattern), the treatment ID server uses the index to          model and sends the model to all of the Timeline servers (step 4).
return the user ID and treatment time for users who have posted a           Each Timeline server applies the propensity score model to assign
message matching the query. In addition, the Treatment ID server            users to strata, scan over outcomes experienced by each user and
returns a sample of the remainder of the population to be used as a         partially aggregate the outcomes. These outcomes are returned
control group. These user IDs are each returned with an assigned            to the Outcome Aggregator on the centralized query node (step
placebo time. The size of the control sample is given as a multiple         5). These outcomes from all data nodes are aggregated and then
of the treatment population size. The larger the control population,        returned to the app user (step 6).
the more likely that there will be similar users (i.e., better matches)
between the treated and control populations. The trade-off is that          5   APPLICATIONS AND EVALUATION
analyzing a larger control population will require more time.               Our work can be seen as part of the broader trend in search systems
Model Builder. The Model Builder collects the covariates and                of bridging the online and physicals worlds [6]. Using social media
treatment/control status of users (or samples of users) from all Data       as a longitudinal sensor into people’s experiences, we build a digital
Answering Expectation Exploration Queries using Causal Inference                                           DESIRES 2018, August 2018, Bertinoro, Italy


                     Nodes
                      Data
                                 Treatment ID Server treatment times                        Timeline Server
                                                        2                       3                                    5

                                     query                        covariates,                    model                   partially agg.
          Query              1                                    T/C status               4                              outcomes            Results
                     Query
                     Node                                                       Model Builder                 Outcome Aggregator          6


                                                            Figure 3: Outcomes Engine


representation of the consequences of actions and situations. A                     In general, we have found that displaying samples of the underly-
key component to ensuring the interpretabilty and usefulness of                     ing supporting evidence—i.e., messages written by individuals who
this information for improved exploration and decision-making is                    have had an experience and a particular consequence—provides
how and when applications present this information and enable                       significant help in interpreting results and understanding potential
interaction. In this section, we discus some of the considerations                  underlying causal mechanisms for an outcome [28]. Beyond these
for applications and how they might be evaluated.                                   domain-agnostic presentations of textual data, domain-specific ap-
                                                                                    plications may utilize additional domain knowledge and context to
                                                                                    improve interpretability.
5.1    Applications
Applications for Individuals. First, we believe that individuals                    5.2    Evaluation Strategies
may benefit from the kind of outcomes we uncover. For instance,                     We propose three key criteria for evaluating the quality of expec-
prior work on online health communities indicates that new pa-                      tation maps and their use: the correctness of the expectations; the
tients seek experience-based information from others in similar situ-               interpretability of results; and the overall usefulness of the informa-
ations for advice, or to validate their feeling or life decisions [13, 16]          tion for searchers.
In such a scenario, our work can support users in exploring the type                Correctness. In prior work, we measured the surface validity of
of issues others in similar situations are likely to have experienced               results of our analysis across a broad variety of domains (including
as a consequence. Further, even when the outcomes of an action                      in health, business, and society topics), based on manual annotation
or situation are known, aggregated statistics about their likelihood                of outcomes by crowd-workers [28]. Here we briefly summarize
can prove informative for those seeking information about them.                     our evaluation method and key results. Each specific expectation—
Apart from helping individuals understand new situations, infor-                    a relationship between a single experience and a single outcome
mation about potential outcomes can also be used to support them                    of the experience—was shown to workers, with a question about
in achieving goals or making decisions.                                             whether a person who had the given experience would be more
   Figure 1 shows user interface sketches that present expectation                  likely to talk about the outcome in the future. To aid intepretability,
maps in different forms. The timeline representation, shown in Fig-                 we provided workers with two pairs of text examples of experience
ure 1-a, can help users understand how outcomes evolve following                    and outcome messages, and links to web search results for the
an action or experience. A list of pros and cons may be better suited               experience and consequence. With these annotations, we measured
in decision-support scenarios to ensure that the decision-maker                     the precision of results @N (ranked by effect size).
is aware of the most important consequences, good and bad, of a                        Figure 4 shows the precision variation at different cut-offs across
choice. Conversational assistants may use expectation maps to aid                   experiments. We notice a drop of 10-20% in precision from the
topical chit chat and banter, as well as provide more direct advice                 top 5 to the top 20 outcomes—with the median precision dropping
and information support.                                                            from close to 80% to about 50%, followed by a slower overall de-
Application for Policy-makers & Scientists. While our work                          cay. Yet, even after the top 30, the discovered outcomes attain an
is motivated primarily by the desire to help individuals under-                     average perceived precision of over 50%. These results have two
stand their situations and the possible implications of their actions               main takeaways: overall, the discovered outcomes tend to attain
on a need basis, there is also an opportunity to use this kind of                   good precision scores across experiences, which correlate with their
analysis to better understand behavioral phenomena of societal                      effect size. Separately, we find that P@10 varies across domains—
importance, third-party interventions and other policy questions.                   ranging from over 55% to 100% on average per domain—and that
Further, large, quantitative analyses such as ours can complement                   the perceived precision varies strongly with the data volume it was
small-scale qualitative or survey-based studies of social phenomena                 computed on. This partially explains the variance of P@10 across
(e.g., see [8, 18]), and vice-versa. Insights about topics of interest              domains. However, other factors, such as errors in the semantic
may inform what questions are being asked, while insights on tem-                   interpretation of words and domain-specific biases in the likelihood
poral dynamics may be used to align survey answers with time                        of users to mention certain outcomes might also play a factor.
dependent-episodes [14].                                                               Beyond evaluating the surface validity of results, another method
   Across all of these potential uses of expectation maps by individ-               for evaluating the correctness of expectation maps is prediction
uals and policy-makers, there are important questions about how                     over hold out data. If our predictions are reliable, our treatment
searchers interact with this information and how to best support                    effect estimates should match that seen in hold out data. Finally, as a
their tasks, their exploration and their understanding of this data.                truly end-to-end test of accuracy, we may consider asking searchers
DESIRES 2018, August 2018, Bertinoro, Italy                                                                                                 E. Kıcıman et al.


                                                                          Acknowledgments
                                                                          This work builds on ideas developed over several years through
                                                                          collaborations and discussions with many people. We would like to
                                                                          thank many colleagues, including Paul Bennett, Scott Counts, Mun-
                                                                          mun De Choudhury, Susan Dumais, Adam Fourney, Myeongjae
                                                                          Jeon, Alexandra Olteanu, Matt Richardson, Michael Lowell Roberts,
                                                                          Du Su, James Thomas, Onur Varol, Ryen White, Xinhao Yuan, Li-
                                                                          dong Zhou, and Brian Zill.

                                                                          REFERENCES
                                                                           [1] Sofiane Abbar, Yelena Mejova, and Ingmar Weber. 2015. You tweet what you eat:
                                                                               Studying food consumption through twitter. In Proc. of ACM CHI. 3197–3206.
Figure 4: Variations in precision across top N outcomes. The               [2] Omar Alonso, Serge-Eric Tremblay, and Fernando Diaz. 2017. Automatic Genera-
boxplots summarize the precision@N across 39 distinct situ-                    tion of Event Timelines from Social Data. In Proceedings of the 2017 ACM on Web
                                                                               Science Conference. ACM, 207–211.
ations in 9 domains within health, business and society top-               [3] Tim Althoff, Pranav Jindal, and Jure Leskovec. 2017. Online actions with offline
ics. Red lines represent the median, while dots the mean.                      impact: How online social networks influence online and offline user behavior.
                                                                               In Proc. of ACM WSDM. ACM, 537–546.
                                                                           [4] Kat Austen. 2015. What could derail the wearables revolution? Nature 525 (2015).
                                                                           [5] Ricardo Baeza-Yates. 2005. Searching the future. In SIGIR Workshop MF/IR.
to see how their experiences evolved, and how well that matches            [6] Wolfgang Büschel, Annett Mitschick, and Raimund Dachselt. 2018. Here and
                                                                               Now: Reality-Based Information Retrieval: Perspective Paper. In Proceedings of
our mined expectations.                                                        the 2018 Conference on Human Information Interaction&Retrieval. ACM, 171–180.
Interpretability. While results may be technically correct, searchers      [7] Ricardo Campos, Gaël Dias, Alípio M Jorge, and Adam Jatowt. 2015. Survey of
are more likely to be successful if the results they see are quickly           temporal information retrieval and related applications. ACM Computing Surveys
                                                                               (CSUR) 47, 2 (2015), 15.
and easily interpretable. Methods for improving interpretability can       [8] Wen-Ying Sylvia Chou, Yvonne Hunt, Anna Folkers, and Erik Augustson. 2011.
rely on exploration, supporting evidence and context, as mentioned             Cancer survivorship in the age of YouTube and social media: a narrative analysis.
                                                                               Journal of medical Internet research 13, 1 (2011).
above. While evaluating the interpretabilty of results presents many       [9] Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013.
challenges and is left largely for future work, we believe it will ben-        Predicting Depression via Social Media. In Proc. of AAAI ICWSM.
efit from earlier methods developed for quantitative and qualitative      [10] Munmun De Choudhury, Emre Kiciman, Mark Dredze, Glen Coppersmith, and
                                                                               Mrinal Kumar. 2016. Discovering shifts to suicidal ideation from mental health
evaluation of search quality [21, 24, 37]                                      content in social media. In Proc. of ACM CHI. 2098–2110.
Usefulness. To truly understand the end-to-end benefits of this           [11] D. Eckles and E. Bakshy. 2017. Bias and high-dimensional adjustment in observa-
for end users, however, we must perform end-to-end studies of                  tional studies of peer effects. ArXiv e-prints (June 2017). arXiv:stat.ME/1706.04692
                                                                          [12] Kate Ehrlich and N Sadat Shami. 2010. Microblogging Inside and Outside the
the usefulness of the results in improving people’s outcomes—e.g.,             Workplace. In AAAI Conf. on Weblogs and Social Media.
are searchers more confident in their choices and making better           [13] Jordan Eschler, Zakariya Dehlawi, and Wanda Pratt. 2015. Self-Characterized
                                                                               Illness Phase and Information Needs of Participants in an Online Cancer Forum.
decisions? For this purpose, we recommend long-running user                    In Proc of. AAAI Conf. on Web and Social Media.
studies and surveys that capture the situations people are exploring,     [14] Adam Fourney, Ryen W White, and Eric Horvitz. 2015. Exploring time-dependent
why they are exploring them (whether for immediate decision-                   concerns about pregnancy and childbirth from search logs. In Proc. of the ACM
                                                                               CHI. 737–746.
making, for long-term planning, or simply out of curiosity), and          [15] Venkata Rama Kiran Garimella, Ingmar Weber, and Sonya Dal Cin. 2014. From "I
later come back to the user and ask them about how this information            love you babe" to "leave me alone"-Romantic Relationship Breakups on Twitter.
affected their behavior, choices, and possibly even outcomes.                  In Conf. on Social Informatics. Springer, 199–215.
                                                                          [16] Jina Huh and Mark S Ackerman. 2012. Collaborative help in chronic disease
                                                                               management: supporting individualized problems. In Proceedings of the ACM
6   CONCLUSIONS                                                                2012 conference on Computer Supported Cooperative Work. ACM, 853–862.
                                                                          [17] Bernard J Jansen, Danielle L Booth, and Amanda Spink. 2008. Determining the
As computing devices continue to become more embedded in our                   informational, navigational, and transactional intent of Web queries. Information
everyday lives, they are mediating an increasing number of our                 Processing & Management 44, 3 (2008), 1251–1266.
                                                                          [18] Lloyd D Johnston, Patrick M O’Malley, Jerald G Bachman, and John E Schulenberg.
interactions with the world around us. From helping people search              2011. Monitoring the Future national survey results on drug use, 1975-2010.
for the best product to buy, to recommending a restaurant we are               Volume I: Secondary school students. (2011).
                                                                          [19] Emre Kıcıman, Scott Counts, and Melissa Gasser. 2018. Using Longitudinal
likely to enjoy, computing services enable users to evaluate op-               Social Media Analysis to Understand the Effects of Early College Alcohol Use. In
tions and take action with “one click”. While such services model              ICWSM-18. AAAI.
many facets of the options they present, they do not model the            [20] Emre Kıcıman and Matthew Richardson. 2015. Towards decision support and
                                                                               goal achievement: Identifying action-outcome relationships from social media.
higher-level implications and trade-offs inherent in deciding to               In Proc. ACM KDD. 547–556.
take one action instead of another. By aggregating the combined           [21] Shirlee-ann Knight and Janice Burn. 2005. Developing a framework for assessing
experiences of hundreds of millions of people, our search services             information quality on the World Wide Web. Informing Science 8 (2005).
                                                                          [22] Nicolas Kokkalis, Thomas Köhn, Johannes Huebner, Moontae Lee, Florian Schulze,
have an opportunity to provide significant assistance to individuals           and Scott R Klemmer. 2013. Taskgenies: Automatically providing action plans
in their expectation explorations and decision-making. Integrating             helps people complete tasks. ACM Transactions on Computer-Human Interaction
                                                                               (TOCHI) 20, 5 (2013), 27.
causal inference as a fundamental piece of this analysis allows us        [23] Briege M Lagan, Marlene Sinclair, and W George Kernohan. 2010. Internet use
to capture consequences of actions and situations that enables our             in pregnancy informs womenâĂŹs decision making: a web-based survey. Birth
search services to be better integrated into interventions, such as            37, 2 (2010), 106–115.
                                                                          [24] Dmitry Lagun and Eugene Agichtein. 2011. Viewser: Enabling large-scale remote
decision-support, planning, and advice scenarios, where correla-               user studies of web search examination and interaction. In Proceedings of the 34th
tional analyses may be too risky given consequential outcomes.                 international ACM SIGIR conference on Research and development in Information
Answering Expectation Exploration Queries using Causal Inference                          DESIRES 2018, August 2018, Bertinoro, Italy


     Retrieval. ACM, 365–374.
[25] Edith Law and Haoqi Zhang. 2011. Towards large-scale collaborative planning:
     Answering high-level search queries using human computation.. In AAAI.
[26] Andrew Meola. 2016. Wearables and mobile health app usage has surged by 50%
     since 2014. http://www.businessinsider.com/fitbit-mobile-health-app-adoption-
     doubles-in-two-years-2016-3. (2016). [Online; Accessed 27-July-2016].
[27] Mark Myslín, Shu-Hong Zhu, Wendy Chapman, and Mike Conway. 2013. Using
     Twitter to examine smoking behavior and perceptions of emerging tobacco
     products. Journal of medical Internet research 15, 8 (2013).
[28] Alexandra Olteanu, Onur Varol, and Emre Kıcıman. 2017. Distilling the outcomes
     of personal experiences: A propensity-scored analysis of social media. In Proc. of
     CSCW 2017. ACM, 370–386.
[29] John Paparrizos, Ryen W White, and Eric Horvitz. 2016. Screening for pancreatic
     adenocarcinoma using signals from web search logs: Feasibility study and results.
     Journal of Oncology Practice 12, 8 (2016), 737–744.
[30] Michael J Paul and Mark Dredze. 2011. You are what you Tweet: Analyzing
     Twitter for public health.. In Proc of. AAAI ICWSM. 265–272.
[31] Michael J Paul, Ryen W White, and Eric Horvitz. 2015. Diagnoses, decisions, and
     outcomes: Web search as decision support for cancer. In Proc. of WWW. ACM.
[32] Andrew Perrin. 2015. Social media usage: 2005-2015. (2015).
[33] Matthew Richardson. 2008. Learning about the world through long-term query
     logs. ACM Transactions on the Web 2, 4 (2008), 21.
[34] Daniel E Rose and Danny Levinson. 2004. Understanding user goals in web
     search. In Proceedings of the 13th Intl. conference on World Wide Web. ACM.
[35] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, mod-
     eling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.
[36] Abigail J Sellen, Rachel Murphy, and Kate L Shaw. 2002. How knowledge work-
     ers use the web. In Proceedings of the SIGCHI conference on Human factors in
     computing systems. ACM, 227–234.
[37] Diana Tabatabai and Bruce M Shore. 2005. How experts and novices search the
     Web. Library & information science research 27, 2 (2005), 222–248.
[38] Rannie Teodoro and Mor Naaman. 2013. Fitter with Twitter: Understanding
     Personal Health and Fitness Activity in Social Media. In AAAI Conf. on Weblogs
     and Social Media.