Answering What If, Should I and Other Expectation Exploration Queries Using Causal Inference over Longitudinal Data Emre Kıcıman Jorgen Thelin Microsoft Research Microsoft Research emrek@microsoft.com jthelin@microsoft.com ABSTRACT their illnesses and coping strategies [8, 13]. People report and share Many people use web search engines for expectation exploration: this information for many reasons: keeping in touch with friends, exploring what might happen if they take some action, or how gaining social capital, diary-keeping, or even helping others. And they should expect some situation to evolve. While search engines with increasing use of personal sensors and devices, from exercise have databases to provide structured answers to many questions, trackers to health monitors, such data streams are becoming more there is no database about the outcomes of actions or the evolution regular, more detailed and more reliable [4, 26, 32]. These longitu- of situations. The information we need to answer such questions, dinal data streams, in aggregate, capture a rich set of relationships however, is already being recorded. On social media, for example, between the situations in which people find themselves, the actions hundreds of millions of people are publicly reporting about the they choose to take, and the outcomes they experience. actions they take and the situations they are in, and an increasing We describe Outcomes Engine, a system for analyzing such large- range of events and activities experienced in their lives over time. scale longitudinal data to characterize how situations evolve over Here, we show how causal inference methods can be applied to such time, and to capture the consequences of people’s actions. Given a data to generate answers for expectation exploration queries. This query representing some target action T , Outcomes Engine iden- paper describes a system implementation for running ad-hoc online tifies individuals who have reported doing T , and compares their causal inference analyses. The analysis results can be used to gen- subsequent experiences to peers who did not report doing T . This erate pros/cons lists for decision support, timeline representations comparison results in an expectation map detailing “what changes to show how situations evolve, and be embedded in many other to expect” over time due to T . A key aspect of Outcomes Engine decision support and planning applications. We discuss potential is its use of causal inference methods to compare the two sets of methods for evaluating the fundamental quality of inference results individuals so as to isolate the specific consequences of T from and judge the short-term and long-term usefulness of information subsequent experiences that are correlated with, but not due to T . for users. The expectation maps generated by Outcomes Engine are an important building block for a wide variety of data-driven search 1 INTRODUCTION and decision-support applications—from automatically generating decision aids, such as pros and cons lists, to helping individuals Everyone, at some point in their lives, finds themselves in an un- ground their experiences in how a situation is likely to evolve over familiar situation, considering what they should do, and trying time (cf Figure 1). In addition, expectation maps may be useful for to understand what to expect of the future. We see such expecta- policy makers’ and scientists’ explorations across a variety of do- tion exploration occurring in web searches, with people exploring mains. In this paper, we discuss our approach and prototype system, possible consequences of their choices and the outcomes of situa- several application scenarios, as well as evaluation challenges and tions. These explorations cover both consequential topics, such as strategies. life-changing education and career choices (e.g., “Should I join the military?”) or major financial and personal decisions (e.g., “Should I move to California?”); as well as more quotidian topics, such as the consequences of purchase decisions, athletic training regimens 2 BACKGROUND AND RELATED WORK and dating rituals. 2.1 Expectation Exploration Tasks The answers to these questions are not readily available in a Exploring expectations on the Internet plays an important role in knowledge base or Wikipedia. But, the information necessary to people’s planning, decision-making, and forecasting for both ev- answer these questions is already being recorded on social media, eryday and extraordinary scenarios. These explorations encompass where hundreds of millions of individuals regularly and publicly a broad variety of tasks, including explorations of hypothetical, report their personal experiences, including the situations they ongoing or past problems, or seeking informational support, emo- are in, the actions they take, and the experiences they have after- tional satisfaction, or preparation for a future event. Taxonomies wards. For example, people talk about work or relations [12, 15] of web search activities classify these as an information gathering health and dietary practices [1, 38], and even log information about task, which encompass 35% to 80% of people’s web searches [17, 36]. Expectation exploration may also be considered as a temporal web DESIRES 2018, August 2018, Bertinoro, Italy query, where time is relative to individually experienced timelines, © 2018 Copyright held by the author(s). rather than, for example, a calendar date or global event [5, 7]. DESIRES 2018, August 2018, Bertinoro, Italy E. Kıcıman et al. Do people buy new cars after a raise? Should I get a dog? Hey, I sprained my ankle badly Luxury cars Small cars Pros Cons When will I play football again? Love the dog Early wake up Enjoy walks Scratched furniture People start to mention playing football after 8 weeks … … (a) Timeline answer (b) Pros / cons list (c) Conversational agent Figure 1: Interface mockups: Expectation exploration tasks may be satisfied with a variety of information presentations. Decision-making processes in particular depend critically on 2.2 Causal Inference such information gathering—especially in unfamiliar situations— In this paper, we propose to analyze individual-level longitudinal where the web augments more conventional information sources datasets with causal inference methods to directly identify what such as professional and friends’ advice, training, etc. In 2004, Rose can be expected following some action or individual experience. and Levinson measured advice-related searches as 2-5% of web We believe this can provide a semi-structured representation of search tasks [34]. Bailey et al. find that decision-related tasks— expectations that can be used in a wide variety of ways to aid including comparing ( 9%) and planning ( 2%)—constitute a signif- individual’s planning, decision-making, and forecasting. icant portion of overall web tasks. Lagan et al. find that even in Because we are interested in using our analysis results to aid pregnancy—a scenario with dedicated information infrastructures, decision-making—essentially an intervention—our goal is funda- related health professionals and care programs—over 80% of women mentally one of causal inference. While we do not believe we can used web search to help make decisions [23]. achieve the ideal identification of causal relationships, we can use Though there are online resources and crowdsourced methods methods borrowed from the causal inference literature to reduce for exploring some scenarios, extracting outcomes from aggregated the bias of naïve correlational analyses. Here, we give a brief intro- personal data streams has many distinct advantages [22, 25] First, duction to potential outcomes, one framework for causal reason- results are grounded in the real experiences of users who have ing [35]. taken an action, potentially leading to more reliable results than In the potential outcomes framework, whether some experi- simply reading advice from web pages. Second, a question may ence “causes” an outcome is computed by comparing two potential be too rare for someone to have devoted writing advice about it, outcomes: one outcome Yi (T = 1) after a person i has a target but there is still plenty of social data to answer via data mining. experience T 1 , and another outcome Yi (T = 0) when the same For example, someone may ask whether to move to one city vs. person in an identical context does not have the experience. The another. Web pages may exist to answer such a question for some causal effect of T is then Yi (T = 1) − Yi (T = 0). Of course, it is city pairs, but not for all. In contrast, we need only look at social impossible to observe both Yi (T = 1) and Yi (T = 0) for the same postings from people who have moved to one city vs. the other individual i. Once we observe i having the experience or not, we and compare their postings to see the relative benefits of each. cannot observe the other, counterfactual outcome. Third, an answer may be contextually dependent on the asker. The Thus, the problem of causal inference is, in a sense, a problem of methods presented in this paper can potentially be extended to missing data, and causal inference techniques attempt to address provide answers personalized to the asker. this challenge by estimating the missing counterfactual outcome for Once an expectation map has been extracted for a scenario, it an individual based on the outcomes of other, similar individuals. A can be embedded in many distinct presentations and applications common method for estimating missing counterfactual outcomes to provide the asker with a high-level overview of the implications is to find pairs (generalizing to groups) of individuals in the ob- of a choice or evolution of a situation. For example, a timeline view servational data whose covariates are statistically very similar to may show how outcomes evolve over time (Figure 1-a). Another one another, but where one has received a treatment and the other application, specifically for decision support, is an automatically has not. Each individual’s matched partner then provides the basis generated pros/cons list [20] (Figure 1-b). The resultant data could for estimating a counterfactual outcome for that individual. We also be used within a conversational agent (Figure 1-c). describe our specific method in Section 3.2. While our work may benefit individuals who wish to understand Prior research demonstrates the feasibility of this approach in their situations and the possible implications of their actions, there high-dimensional settings (such as our proposed analysis of social is also an opportunity to use this kind of analysis to better under- media and sensor data). For example, Eckles and Bakshy reduced stand behavioral phenomena of societal importance, third-party bias in an observational study by 97% compared to a naive analysis, interventions and other policy questions. As well, while we focus as measured against a gold-standard randomized field experiment, on analysis of timelines of individual people’s experiences, such by conditioning on high-dimensional covariate data [11]. analyses may also be applied to event timelines of other kinds [2], subject to sufficient data availability and assumptions. 1 In medical and social sciences literature, the target experience is often called the treatment, and is compared to a control or placebo experience. Following this convention, we will use the terms treated group and control group in this paper. Answering Expectation Exploration Queries using Causal Inference DESIRES 2018, August 2018, Bertinoro, Italy 2.3 Social and Online Data Analyses Input Query. Asking a question to explore expectations following Longitudinal studies of online data, including social media data an action or event requires identification of individuals who have and search query logs, have proven effective in helping understand performed a particular action or experienced a particular situation. the behaviors of people in various situations. These studies have The pattern for identifying messages about this experience, then, been targeted to explore and understand how situations evolve over is the fundamental input query we expect. Our prototype relies on time, identify predictive factors involved in positive and negative explicit textual mentions of actions and situations and, in our design, outcomes, and help identify at-risk individuals. For example, using we allow a boolean phrase query, with some wildcard support, for search query logs, Paul et al. [31] characterize the information seek- identifying a targeted experiential phrases. ing behavior during various phases of prostate cancer. Fourney et Expectation Maps. Expectation maps represent the time-varying al. [14] align search query logs with the natural clock of gestational effects of an experience or treatment over a population of people. physiology of pregnant women to characterize their changing in- An expectation map for a treatment can be represented as a 2D formation needs. Althoff et al. study 5 years of fitness tracking data matrix, where each row is an outcome word or topic, each column to better understand social influence on physical activity [3]. represents an epoch of time (e.g., hours or days since treatment). By mining social media, De Choudhury et al. [9] find behavioral Each cell represents the effect of the treatment on a specific outcome cues useful to predict the risk of depression before onset. Simi- during a specific epoch. The effect itself includes measurements larly, by leveraging these naturalistic data, prior work examined of effect size and statistical significance, and can be extended to how dietary habits vary across locations [1]; the links between include details of heterogeneous effects. diseases, drugs, and side-effects [27, 30]; links between actions and outcomes [20]; shifts in suicidal ideation [10]; and how alcohol usage in early college affects long-term outcomes [19]. Olteanu et 3.2 Causal Inference Method al. demonstrate propensity scored analysis of social media timelines In our system, we use a stratified propensity score analysis to esti- to understand outcomes across a broad set of domains [28] mate missing counterfactual outcomes by identifying matching sub- populations of individuals with similar distributions of covariates, 3 CAUSAL INFERENCE-BASED MAPPING OF but with differing treatment status. Given a set of social media mes- EXPECTATIONS sages, we apply a preprocessing step to generate a set of per-user We present our approach to mapping expectations from social timelines. Once a query is issued, we identify the users that have datasets. First, we present our basic design data requirements and mentioned the treatment experience and place them in a treated assumptions, followed by our definition of a query and result repre- group, and place all other users in a control group. We align user sentation. Then, we present our method for extracting expectations timelines based on when the individual mentioned experiencing the by applying causal inference over social data sets. We use causal in- treatment. We align the control users based on a random “placebo” ference for this purpose to remove merely correlated outcomes and time. To reduce the effects of temporal biases, we assign placebo focus on outcomes directly caused by an action or treatment. This times to match the distribution of treatment times. is particularly important for applications that will be performing in- Stratification is achieved by estimating each individual’s likeli- terventions (including decision-support applications for individuals hood of being in the treated group using a propensity score model. and policy makers) This is a learned function that infers likelihood of being in the treated group as a function of a set of covariates (i.e., individual 3.1 Basic Design properties and past tweets that might influence both treated/control status and outcomes). Individuals with similar propensity scores Data. The fundamental requirements our approach places on data is are grouped into strata. In aggregate, individuals within a strata are that they provide a longitudinal view of the actions and experiences likely to have similar covariates, allowing us to isolate and estimate of individuals. Thus, at a minimum, input data observations must the effects of the treatment itself within each strata. Note that the include a user id and datetime in addition to observational content primary purpose of the propensity score model is to identify groups (e.g., message text). of individuals with similar covariates—the accuracy of predicting We focus our prototype implementation on social media data for group status is secondary. To ensure the quality of counterfactual several reasons. First, social media data provides high-dimensional estimates, the method drops strata that have either too few treated and cross-domain coverage, allowing a broad variety of query topics or too few control users. Outcomes are aggregated across remain- and increasing the likelihood of observing statistical confounders ing strata, weighted by the size of the treatment population in the that would otherwise bias an analysis. Secondly, the textual nature strata, to estimate the average effect of treatment on the treated of social media data is relatively interpretable. Third, social media population. data is available at large-scale and captures individual activities The details of our analysis are as follows: over long periods of time. Beyond social media, our framework may Covariate and outcome features: The content of social media be applied to other kinds of data sources. E.g. personal sensors and messages from before the treatment (or placebo) time, as well as other services may be supported, though treatment identification other user properties (posting frequencies, message lengths, pro- and result interpretation in our framework would require adap- file information, etc.) are extracted as covariates—potentially con- tation. Search query histories are particularly promising, as past founding features that may influence both treatment status and analyses have demonstrated the potential for longitudinal analysis outcomes. The content of social media messages after the treatment of search histories [3, 14, 29, 33]. DESIRES 2018, August 2018, Bertinoro, Italy E. Kıcıman et al. (or placebo) are extracted as the time-varying outcome measures Arrays of token occurrence timestamps of the treatment. We represent social media message content in our covariate and w0 outcome features as empirical, unsmoothed word likelihoods. We w1 t0 t1 ... ti ... tk-1 tk limit our word distributions to the top 50k unigrams in our cor- pus. We do not remove stopwords, stem or normalize the text, and ... use whitespace and punctuation to identify word-breaks. Option- w1 occurs i-1 times w1 occurs k-i times ally, given a word-to-topic mapping, we combine outcome word in covariate window in outcome window likelihoods to generate the total topic likelihood. Propensity score modeling: We implement our high-dimensional treatment/placebo time propensity score analysis as a logistic regression with 10-fold cross- validation. Our analysis divides users into 100 strata, removes strata Figure 2: Timeline data structure with either or both too few Treated or too few Control users. In practice, this removes the lowest-propensity strata and the highest- propensity strata, leaving the middle strata in these analyses. The nodes. Then, it applies a supervised algorithm to learn a model outcome differences in these remaining strata are weighted accord- of the propensity of users to be treated. This learned model is ing to the Treated population distribution and combined to estimate distributed across all the data nodes. the average treatment effect on the Treated group. Timeline Server. The Timeline server stores, for each user, a com- While we borrow propensity score analysis from the causal in- pressed representation of the timeline of token occurrences (the ference literature, our application of this technique is not a causal unigrams, bigrams, or phrases mentioned by users). Given a treat- analysis, as two key assumptions may not hold: First, all confound- ment (or placebo) time for a user, the timeline server can quickly ing variables must be included in the observed covariates. Yet, while return a summary representation of covariates, or a summary rep- high-dimensional propensity score analyses, such as ours, are more resentation of outcomes. Figure 2 shows a sketch of the simple likely to capture those variables correlated with confounding vari- timeline data structure. For each token that has been used by a user, ables, it is difficult to argue that all relevant aspects of individuals’ we use a binary search to identify the array index of the treatment lives are captured in their Twitter streams. Second, the stable unit time, and compute the number of occurrences of the token from treatment value assumption (SUTVA) must hold—that is, one per- the index value. Simple extensions allow us to calculate the number son’s outcome must be independent of whether another person had of occurrences within arbitrary time windows. the target experience. Additional domain knowledge is required to Outcome Aggregator. The Outcome aggregator is responsible for assert these assumptions. gathering the partially aggregated outcomes from data nodes, iden- tifying strata to drop due to lack of comparable subpopulations, and performing a weighted aggregation of outcomes across remaining 4 OUTCOMES ENGINE ARCHITECTURE strata. In addition, the Outcome aggregator runs diagnostics on the To execute online ad-hoc causal inference analyses over large-scale analysis, such as covariance balance and other validity tests. datasets, we must provide scalable implementations for treatment Request flow. As shown in Figure 3, when a request arrives from identification, covariate and outcome extraction, and propensity an application to the query node, the query node first forwards score modeling. We use a two-tiered approach to our cluster design: the query to all data nodes (step 1), where the Treatment ID server 1) User data is distributed randomly across data nodes, with all identifies the treated and control groups and individuals’ treatment data from a single user assigned to a single node. Each data node and placebo times (step 2). Then, each Timeline server featurizes consists of a Treatment Identification server and a Timeline server. the covariates for these users and returns these covariates and their 2) A centralized query node is responsible for distributing queries treated/control labels to a Model Builder in the centralized query across all data nodes, centralized building of the propensity score node (step 3). If the treatment and control groups are very large, model, and aggregating stratified outcomes. they can be downsampled to improve end-to-end performance. Treatment ID Server. The Treatment ID server provides an index The Model Builder collects these covariate and label data from over the full text of text messages. Given a query (the treatment all the replication nodes, dynamically learns a propensity score identification pattern), the treatment ID server uses the index to model and sends the model to all of the Timeline servers (step 4). return the user ID and treatment time for users who have posted a Each Timeline server applies the propensity score model to assign message matching the query. In addition, the Treatment ID server users to strata, scan over outcomes experienced by each user and returns a sample of the remainder of the population to be used as a partially aggregate the outcomes. These outcomes are returned control group. These user IDs are each returned with an assigned to the Outcome Aggregator on the centralized query node (step placebo time. The size of the control sample is given as a multiple 5). These outcomes from all data nodes are aggregated and then of the treatment population size. The larger the control population, returned to the app user (step 6). the more likely that there will be similar users (i.e., better matches) between the treated and control populations. The trade-off is that 5 APPLICATIONS AND EVALUATION analyzing a larger control population will require more time. Our work can be seen as part of the broader trend in search systems Model Builder. The Model Builder collects the covariates and of bridging the online and physicals worlds [6]. Using social media treatment/control status of users (or samples of users) from all Data as a longitudinal sensor into people’s experiences, we build a digital Answering Expectation Exploration Queries using Causal Inference DESIRES 2018, August 2018, Bertinoro, Italy Nodes Data Treatment ID Server treatment times Timeline Server 2 3 5 query covariates, model partially agg. Query 1 T/C status 4 outcomes Results Query Node Model Builder Outcome Aggregator 6 Figure 3: Outcomes Engine representation of the consequences of actions and situations. A In general, we have found that displaying samples of the underly- key component to ensuring the interpretabilty and usefulness of ing supporting evidence—i.e., messages written by individuals who this information for improved exploration and decision-making is have had an experience and a particular consequence—provides how and when applications present this information and enable significant help in interpreting results and understanding potential interaction. In this section, we discus some of the considerations underlying causal mechanisms for an outcome [28]. Beyond these for applications and how they might be evaluated. domain-agnostic presentations of textual data, domain-specific ap- plications may utilize additional domain knowledge and context to improve interpretability. 5.1 Applications Applications for Individuals. First, we believe that individuals 5.2 Evaluation Strategies may benefit from the kind of outcomes we uncover. For instance, We propose three key criteria for evaluating the quality of expec- prior work on online health communities indicates that new pa- tation maps and their use: the correctness of the expectations; the tients seek experience-based information from others in similar situ- interpretability of results; and the overall usefulness of the informa- ations for advice, or to validate their feeling or life decisions [13, 16] tion for searchers. In such a scenario, our work can support users in exploring the type Correctness. In prior work, we measured the surface validity of of issues others in similar situations are likely to have experienced results of our analysis across a broad variety of domains (including as a consequence. Further, even when the outcomes of an action in health, business, and society topics), based on manual annotation or situation are known, aggregated statistics about their likelihood of outcomes by crowd-workers [28]. Here we briefly summarize can prove informative for those seeking information about them. our evaluation method and key results. Each specific expectation— Apart from helping individuals understand new situations, infor- a relationship between a single experience and a single outcome mation about potential outcomes can also be used to support them of the experience—was shown to workers, with a question about in achieving goals or making decisions. whether a person who had the given experience would be more Figure 1 shows user interface sketches that present expectation likely to talk about the outcome in the future. To aid intepretability, maps in different forms. The timeline representation, shown in Fig- we provided workers with two pairs of text examples of experience ure 1-a, can help users understand how outcomes evolve following and outcome messages, and links to web search results for the an action or experience. A list of pros and cons may be better suited experience and consequence. With these annotations, we measured in decision-support scenarios to ensure that the decision-maker the precision of results @N (ranked by effect size). is aware of the most important consequences, good and bad, of a Figure 4 shows the precision variation at different cut-offs across choice. Conversational assistants may use expectation maps to aid experiments. We notice a drop of 10-20% in precision from the topical chit chat and banter, as well as provide more direct advice top 5 to the top 20 outcomes—with the median precision dropping and information support. from close to 80% to about 50%, followed by a slower overall de- Application for Policy-makers & Scientists. While our work cay. Yet, even after the top 30, the discovered outcomes attain an is motivated primarily by the desire to help individuals under- average perceived precision of over 50%. These results have two stand their situations and the possible implications of their actions main takeaways: overall, the discovered outcomes tend to attain on a need basis, there is also an opportunity to use this kind of good precision scores across experiences, which correlate with their analysis to better understand behavioral phenomena of societal effect size. Separately, we find that P@10 varies across domains— importance, third-party interventions and other policy questions. ranging from over 55% to 100% on average per domain—and that Further, large, quantitative analyses such as ours can complement the perceived precision varies strongly with the data volume it was small-scale qualitative or survey-based studies of social phenomena computed on. This partially explains the variance of P@10 across (e.g., see [8, 18]), and vice-versa. Insights about topics of interest domains. However, other factors, such as errors in the semantic may inform what questions are being asked, while insights on tem- interpretation of words and domain-specific biases in the likelihood poral dynamics may be used to align survey answers with time of users to mention certain outcomes might also play a factor. dependent-episodes [14]. Beyond evaluating the surface validity of results, another method Across all of these potential uses of expectation maps by individ- for evaluating the correctness of expectation maps is prediction uals and policy-makers, there are important questions about how over hold out data. If our predictions are reliable, our treatment searchers interact with this information and how to best support effect estimates should match that seen in hold out data. Finally, as a their tasks, their exploration and their understanding of this data. truly end-to-end test of accuracy, we may consider asking searchers DESIRES 2018, August 2018, Bertinoro, Italy E. Kıcıman et al. Acknowledgments This work builds on ideas developed over several years through collaborations and discussions with many people. We would like to thank many colleagues, including Paul Bennett, Scott Counts, Mun- mun De Choudhury, Susan Dumais, Adam Fourney, Myeongjae Jeon, Alexandra Olteanu, Matt Richardson, Michael Lowell Roberts, Du Su, James Thomas, Onur Varol, Ryen White, Xinhao Yuan, Li- dong Zhou, and Brian Zill. REFERENCES [1] Sofiane Abbar, Yelena Mejova, and Ingmar Weber. 2015. You tweet what you eat: Studying food consumption through twitter. In Proc. of ACM CHI. 3197–3206. Figure 4: Variations in precision across top N outcomes. The [2] Omar Alonso, Serge-Eric Tremblay, and Fernando Diaz. 2017. Automatic Genera- boxplots summarize the precision@N across 39 distinct situ- tion of Event Timelines from Social Data. In Proceedings of the 2017 ACM on Web Science Conference. ACM, 207–211. ations in 9 domains within health, business and society top- [3] Tim Althoff, Pranav Jindal, and Jure Leskovec. 2017. Online actions with offline ics. Red lines represent the median, while dots the mean. impact: How online social networks influence online and offline user behavior. In Proc. of ACM WSDM. ACM, 537–546. [4] Kat Austen. 2015. What could derail the wearables revolution? Nature 525 (2015). [5] Ricardo Baeza-Yates. 2005. Searching the future. In SIGIR Workshop MF/IR. to see how their experiences evolved, and how well that matches [6] Wolfgang Büschel, Annett Mitschick, and Raimund Dachselt. 2018. Here and Now: Reality-Based Information Retrieval: Perspective Paper. In Proceedings of our mined expectations. the 2018 Conference on Human Information Interaction&Retrieval. ACM, 171–180. Interpretability. While results may be technically correct, searchers [7] Ricardo Campos, Gaël Dias, Alípio M Jorge, and Adam Jatowt. 2015. Survey of are more likely to be successful if the results they see are quickly temporal information retrieval and related applications. ACM Computing Surveys (CSUR) 47, 2 (2015), 15. and easily interpretable. Methods for improving interpretability can [8] Wen-Ying Sylvia Chou, Yvonne Hunt, Anna Folkers, and Erik Augustson. 2011. rely on exploration, supporting evidence and context, as mentioned Cancer survivorship in the age of YouTube and social media: a narrative analysis. Journal of medical Internet research 13, 1 (2011). above. While evaluating the interpretabilty of results presents many [9] Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. challenges and is left largely for future work, we believe it will ben- Predicting Depression via Social Media. In Proc. of AAAI ICWSM. efit from earlier methods developed for quantitative and qualitative [10] Munmun De Choudhury, Emre Kiciman, Mark Dredze, Glen Coppersmith, and Mrinal Kumar. 2016. Discovering shifts to suicidal ideation from mental health evaluation of search quality [21, 24, 37] content in social media. In Proc. of ACM CHI. 2098–2110. Usefulness. To truly understand the end-to-end benefits of this [11] D. Eckles and E. Bakshy. 2017. Bias and high-dimensional adjustment in observa- for end users, however, we must perform end-to-end studies of tional studies of peer effects. ArXiv e-prints (June 2017). arXiv:stat.ME/1706.04692 [12] Kate Ehrlich and N Sadat Shami. 2010. Microblogging Inside and Outside the the usefulness of the results in improving people’s outcomes—e.g., Workplace. In AAAI Conf. on Weblogs and Social Media. are searchers more confident in their choices and making better [13] Jordan Eschler, Zakariya Dehlawi, and Wanda Pratt. 2015. Self-Characterized Illness Phase and Information Needs of Participants in an Online Cancer Forum. decisions? For this purpose, we recommend long-running user In Proc of. AAAI Conf. on Web and Social Media. studies and surveys that capture the situations people are exploring, [14] Adam Fourney, Ryen W White, and Eric Horvitz. 2015. Exploring time-dependent why they are exploring them (whether for immediate decision- concerns about pregnancy and childbirth from search logs. In Proc. of the ACM CHI. 737–746. making, for long-term planning, or simply out of curiosity), and [15] Venkata Rama Kiran Garimella, Ingmar Weber, and Sonya Dal Cin. 2014. From "I later come back to the user and ask them about how this information love you babe" to "leave me alone"-Romantic Relationship Breakups on Twitter. affected their behavior, choices, and possibly even outcomes. In Conf. on Social Informatics. Springer, 199–215. [16] Jina Huh and Mark S Ackerman. 2012. Collaborative help in chronic disease management: supporting individualized problems. In Proceedings of the ACM 6 CONCLUSIONS 2012 conference on Computer Supported Cooperative Work. ACM, 853–862. [17] Bernard J Jansen, Danielle L Booth, and Amanda Spink. 2008. Determining the As computing devices continue to become more embedded in our informational, navigational, and transactional intent of Web queries. Information everyday lives, they are mediating an increasing number of our Processing & Management 44, 3 (2008), 1251–1266. [18] Lloyd D Johnston, Patrick M O’Malley, Jerald G Bachman, and John E Schulenberg. interactions with the world around us. From helping people search 2011. Monitoring the Future national survey results on drug use, 1975-2010. for the best product to buy, to recommending a restaurant we are Volume I: Secondary school students. (2011). [19] Emre Kıcıman, Scott Counts, and Melissa Gasser. 2018. Using Longitudinal likely to enjoy, computing services enable users to evaluate op- Social Media Analysis to Understand the Effects of Early College Alcohol Use. In tions and take action with “one click”. While such services model ICWSM-18. AAAI. many facets of the options they present, they do not model the [20] Emre Kıcıman and Matthew Richardson. 2015. Towards decision support and goal achievement: Identifying action-outcome relationships from social media. higher-level implications and trade-offs inherent in deciding to In Proc. ACM KDD. 547–556. take one action instead of another. By aggregating the combined [21] Shirlee-ann Knight and Janice Burn. 2005. Developing a framework for assessing experiences of hundreds of millions of people, our search services information quality on the World Wide Web. Informing Science 8 (2005). [22] Nicolas Kokkalis, Thomas Köhn, Johannes Huebner, Moontae Lee, Florian Schulze, have an opportunity to provide significant assistance to individuals and Scott R Klemmer. 2013. Taskgenies: Automatically providing action plans in their expectation explorations and decision-making. Integrating helps people complete tasks. ACM Transactions on Computer-Human Interaction (TOCHI) 20, 5 (2013), 27. causal inference as a fundamental piece of this analysis allows us [23] Briege M Lagan, Marlene Sinclair, and W George Kernohan. 2010. Internet use to capture consequences of actions and situations that enables our in pregnancy informs womenâĂŹs decision making: a web-based survey. Birth search services to be better integrated into interventions, such as 37, 2 (2010), 106–115. [24] Dmitry Lagun and Eugene Agichtein. 2011. Viewser: Enabling large-scale remote decision-support, planning, and advice scenarios, where correla- user studies of web search examination and interaction. In Proceedings of the 34th tional analyses may be too risky given consequential outcomes. international ACM SIGIR conference on Research and development in Information Answering Expectation Exploration Queries using Causal Inference DESIRES 2018, August 2018, Bertinoro, Italy Retrieval. ACM, 365–374. [25] Edith Law and Haoqi Zhang. 2011. Towards large-scale collaborative planning: Answering high-level search queries using human computation.. In AAAI. [26] Andrew Meola. 2016. Wearables and mobile health app usage has surged by 50% since 2014. http://www.businessinsider.com/fitbit-mobile-health-app-adoption- doubles-in-two-years-2016-3. (2016). [Online; Accessed 27-July-2016]. [27] Mark Myslín, Shu-Hong Zhu, Wendy Chapman, and Mike Conway. 2013. Using Twitter to examine smoking behavior and perceptions of emerging tobacco products. Journal of medical Internet research 15, 8 (2013). [28] Alexandra Olteanu, Onur Varol, and Emre Kıcıman. 2017. Distilling the outcomes of personal experiences: A propensity-scored analysis of social media. In Proc. of CSCW 2017. ACM, 370–386. [29] John Paparrizos, Ryen W White, and Eric Horvitz. 2016. Screening for pancreatic adenocarcinoma using signals from web search logs: Feasibility study and results. Journal of Oncology Practice 12, 8 (2016), 737–744. [30] Michael J Paul and Mark Dredze. 2011. You are what you Tweet: Analyzing Twitter for public health.. In Proc of. AAAI ICWSM. 265–272. [31] Michael J Paul, Ryen W White, and Eric Horvitz. 2015. Diagnoses, decisions, and outcomes: Web search as decision support for cancer. In Proc. of WWW. ACM. [32] Andrew Perrin. 2015. Social media usage: 2005-2015. (2015). [33] Matthew Richardson. 2008. Learning about the world through long-term query logs. ACM Transactions on the Web 2, 4 (2008), 21. [34] Daniel E Rose and Danny Levinson. 2004. Understanding user goals in web search. In Proceedings of the 13th Intl. conference on World Wide Web. ACM. [35] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, mod- eling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331. [36] Abigail J Sellen, Rachel Murphy, and Kate L Shaw. 2002. How knowledge work- ers use the web. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 227–234. [37] Diana Tabatabai and Bruce M Shore. 2005. How experts and novices search the Web. Library & information science research 27, 2 (2005), 222–248. [38] Rannie Teodoro and Mor Naaman. 2013. Fitter with Twitter: Understanding Personal Health and Fitness Activity in Social Media. In AAAI Conf. on Weblogs and Social Media.