Opinion Mapping: Information Visualization Approaches for Comparative Sentiment Analysis William H.Hsu Praveen Koduru bhsu@ksu.edu Praveen.Koduru@gmail.com Kansas State University IQ Gateway, Inc. +1 785 236 8247 ABSTRACT visualization of opinions – specifically, thematic mapping of In this position paper, we discuss the problem of extracting opinions. At present, there is a dearth of methods for integrating information about chronic diseases from the large volume of text user profile data for social networks with blog posts, tweets, and written in health blogs, mailing lists, forums, and other electronic other content from the associated social media. These limitations venues, then making this information accessible via structured present an integrative challenge for human-computer interaction queries, while analyzing it to map out patterns among the (HCI) and information retrieval (IR). Towards this end, the opinions and demographics of users. Information retrieval specific aims of the research proposed espoused in this position systems exist for spatially-referenced demographic data about paper are as follows: diseases such as diabetes and their therapies, but as in the case of 1. Aim 1. Extend known algorithms for named entity communicable diseases, the databases that contain such data are recognition and relationship extraction, to produce basic manually populated. For example, in information portals such as summaries of diseases and treatments mentioned in texts. HealthMap.org, which are searchable by location and disease, the The technical objective is to tag where basic entities and data are user-reported and collaboratively maintained, but not opinions are mentioned in freely available text (including automatically extracted from text. Furthermore, there is as yet no both user posts and profiles), then map these tagged elements automated means of relating sentiments expressed by users in in space, time, and by topic, to acceptable levels of precision their text postings to their semistructured profile data. This is and recall. because the primary sources for this kind of information have been statistical surveys such as opinion polls, where text 2. Aim 2. Adapt basic known techniques to the domain of type responses are often human-interpreted and demographic analysis 2 diabetes – specifically, extracting data from text is done post hoc, rather than as part of an information retrieval discussions of diabetes that are archived from health blogs and extraction task. These limitations indicate a present need for and forums using web crawlers. This entails developing a text summarization techniques that integrate quantitative means of handling entities and quantitative data that have not information extraction – which captures symptoms, diseases, and previously been extracted from text, such as information complications of diseases – with opinion summarization. concerning insulin and oral anti-diabetic drug dosage, HbA1c levels, etc. Another functional requirement is some Categories and Subject Descriptors mechanism for entity reference resolution, e.g., abbreviations H.3.3 [Information Storage and Retrieval]: Information search and synonyms, for known terms. Finally, a domain-specific and retrieval – clustering, relevance feedback, selection process; ontology of relevant symptoms, disease attributes, H.2.8 [Database Management]: Database applications – data complications, and treatments is proposed. For type 2 mining, spatial databases and GIS diabetes, this includes topics frequently discussed in health blogs and forums: food groups, meal plans, nutritional constraints, and conditions such as obesity that are linked to General Terms diabetes. This shall facilitate information retrieval Algorithms, Experimentation, Human Factors applications such as question answering about meal plans recommended by primary care physicians and specialists. Keywords sentiment analysis, social networks, geoinformatics, opinion 3. Aim 3. Develop methods for sentiment analysis and mining, subjectivity, information extraction, information improve existing ones, to summarize opinions and visualization, human-computer interaction discover patterns. The technical objective is to relate demographic data extracted from text and profiles to qualitative data – namely, the polarity of text at the 1. INTRODUCTION document, sentence, or aspect level, aggregated across In this paper, we address the problem of information retrieval and demographic categories such as geographic region of information extraction in subjective domains, with applications to residence. Objects of interest for sentiment analysis include prescribed therapies and specifically side effects, but can extend to disease aspects and complications. Permission to make digital or hard copies of all or part of this work for The overall goal of this approach is to develop an integrative personal or classroom use is granted without fee provided that copies are technology for summarizing online text about chronic diseases, not made or distributed for profit or commercial advantage and that capturing opinions from users’ posts and demographic data from copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, a combination of their posts and profiles, and finally using these requires prior specific permission and/or a fee. EuroHCIR 2012. Copyright 2012 ACM to discover global patterns indicated by the set of all text to diabetes, but they do not provide the requisite concepts for documents. The central hypothesis of this work is that a mining free-form text written by lay users who are discussing combination of entity and relationship extraction, driven by a diabetes online. We propose to develop an ontology for text domain-specific ontology of terms, will result in more precise mining in diabetes, and the mappings from extracted entities and and accurate summarization of opinions. This will increase the relationships into this ontology. usefulness of free-form text, written by users of social media, in understanding patterns that are reflected in the opinions and 2.3 Opinion Mining (Sentiment Analysis) demographics of chronic disease patients. This aspect of the proposed work focuses on a basic research problem: sentiment analysis from text, also known as opinion mining, whose objective is to determine from analysis of a written 2. SIGNIFICANCE document what the author’s attitude towards an identifiable topic 2.1 Information Extraction from Health Blogs is. This attitude can be subjective or objective; it can be The chief potential impact of the research framework and test bed identified as an evaluation (positive or negative), a declaration of proposed in Section 1 is to provide assistive technologies to the author’s emotional attitude, or a expression intended to evoke public health analysts and health services analysts who are using an emotional response in the reader. Subjects of interest include blogs, microblogs (e.g., Twitter), and other social media to chronic diseases, their features or aspects including symptoms, explore user opinions about chronic disease issues. As an complications, and treatments, and related health services. example, in the application domain of type 2 diabetes, these include dietary treatments such as carbohydrate control, 2.4 Current State of the Field complications such as gastroparesis induced by diabetes that may pose digestive constraints, and recommendations of primary care physicians, therapists, endocrinologists, nutritionists, etc. The availability of mailing lists, blogs, wikis, and other electronic media for content management and dissemination has resulted in rapid growth in the volume of online text data containing voluntarily expressed public opinions about health issues. While general-purpose metadata tools exist for annotating this text, the opinions themselves remain a largely unexplored source of information about how chronic diseases affect populations. Meanwhile, the task of relating content from these various self- publishing media to semi-structured profile data from their users has not yet been effectively automated. We advocate development of application test beds and experimental systems aimed at improving techniques for Figure 1. Prototype event search based on a previous IR information extraction, ontology development and mapping, and system for veterinary epidemiology. text mining to identify opinion patterns. The potential progress in Figure 1 depicts a simple search interface for an existing IR these areas is due in part to the approach of combining system developed by the principal investigator’s research group. information extraction to discover disease mentions with This system was designed for event extraction in the domain of sentiment analysis to establish opinions, and in part to the viral zoonoses, but uses general-purpose software for web application of this approach to a new source of data: free-form crawling and ranking (the latter is developed using Lucene Java). text describing user demographics, attributes of the chronic One marker is displayed on both the thematic map and the disease of interest and its related entities, and opinions and semi- timeline for each returned page, but the only features extracted by structured profile data. this system are the disease name, formatted dates and times given To help public health researchers tap into these freely available in each article, and locations mentioned in the article. but unexplored sources of opinions, we propose to develop The thematic map suggests several interactive functions related to information extraction (IE) and summarization methods geared at opinion mining. One is content-based filtering of articles using health blog postings and similar text. Such postings contain not the first type of thematic data, demographic and biostatistical only opinions, and attribution information that can be used to link attributes; another, collaborative filtering using the second type, them to the users who expressed them, but also factual data about polarity scores. Both of these use associations that can be learned the posters and their opinions. This data can help place opinions from data: a user can search for articles by entering queries that in a comparative context with population statistics, such as the express certain sentiments. In the first case, entities and attributes reporting frequency of symptoms, side effects, and complications. (e.g., symptoms, complications, and treatments) mentioned in the 2.2 Ontology Development query may match frequent patterns in the data; in the second, The research approach centers around using information polarity scores themselves can be used to retrieve their “nearest extraction to obtain structured data in the form of records about neighbors in opinion space”. chronic disease references in text, which are then linked to users via relational data extracted from their profiles. However, the 3. PROPOSED DIRECTIONS body of relevant concepts in the healthcare domain and in the 3.1 Mining Social Media (Blogs, Lists, Wikis) clinical domain theory of each chronic disease is much broader. As mentioned above in Section 2, our approach applies text Currently there exist pre-clinical (genomic and proteomic) and mining to blogs and social media, a new source of information clinical translational ontologies that contain information relevant that is beginning to be studied for opinion and trending topic data, but has not been analyzed for disease-related information that can geographic region. In this research, the themes fall into two be related to these data. The novelty of our approach is that it categories: the first, demographic attributes and biostatistics extends named entity recognition and relationship extraction to specified by the ontology – some disease-independent, and some the domain of understanding free-form text about aspects of disease-specific; the second, quantized measures of opinion chronic diseases (specifically, opinions about type 2 diabetes, its polarity (i.e., degree of positive or negative sentiment). The complications, dietary recommendations, and drug treatments). It increased support for flexible queries and thematic map further develops methods for mapping these new entities and generation, compared to IR without relationship extraction and relationships to the terms of an ontology for text mining, and sentiment analysis, will help reveal patterns in the data through finally leverages the text contained in many online sources to interactive investigation. produce integrative summaries of disease mentions and associated opinions. 4. TECHNICAL FOCUS AREAS 4.1 Improvements and Refinements to Theory 3.2 New Theory and Methodology The following generic methods are applied in order to meet the 3.2.1 IR (Search Query-Driven) Workflow functional requirements presented in the preceding section. We refer to them as cross-cutting because they are used in service to In IR applications of automatic text summarization, a user enters a all of the technical aims: entity and relationship extraction, free-form search query and views returned hits that are ontology development, and sentiment analysis. summarized by topic and aspect – in this case, author opinion. These hits may be organized by space and time. For example, consider the case of a clinical health services analyst, public 4.1.1 Focused Crawling health analyst, doctor, patient, or other concerned individual who In previous work on IE, applied to news summarization in the is interested in some aspect of a chronic disease. Such a user domain of veterinary epidemiology, we used a combination of typically enters a query into a general-purpose search engine and topical and focused crawling. Topical crawling prioritizes pages is either directed to a domain-specialized web portal, also called a to be crawled based on user-provided terms (i.e., topics) and seeds vertical portal, or browses through documents housed in one. (i.e., links to initial pages), while focused crawling uses both terms and pages labeled as positive or negative examples of We seek to advance the state of the field by supporting structured relevant documents. Once tag-formatted web documents (HTML queries, in which a user specifies fields and constraints in addition or XML) are crawled, text must be extracted from them. to traditional search keywords. This is achieved by combining quantitative text summarization (extraction of attribute values) The on-demand IR system described above functions by passing with recognition of entities and relationships. The collection of the user query to a built-in web crawler that fetches hits from a documents may include some that are dynamically crawled from commercial search engine (in this case, Yahoo). The results are the web in response to the query. The output consists of combined with previously crawled documents, if any, and ranked structured tuples that are ranked by relevance to the query, and indexed as a whole. filtered to remove hits deemed insufficiently relevant, and finally visualized in a map or timeline view. This view allows the user to 4.1.2 Information Extraction more freely explore information by performing interactive The state of the field in IE for web articles describing disease manipulations such as online analytical processing or editing the consists of: payload extraction (of text from HTML), baseline set of constraints. named entity recognition (NER), and extraction of dates, times, and locations in order to localize putative events. In addition to 3.2.2 IE and Summarization (Push) Workflow general open natural language processing problems such as co- IE applications of the proposed summarization technology can be reference resolution (in particular, pronouns and other anaphora), viewed as a more passive variant of the IR application described word sense disambiguation, and canonicalization of dates, other above, from the user’s point of view. No initial query is supplied IE problems that remain unsolved include: resolving alternative by the user, but there is an implicit domain of interest from which abbreviations and synonyms for diseases, disambiguation of place records should be displayed, corresponding to a combined set of names, associating quantities of persons affected with diseases search terms and relevance criteria. When a small set of search mentioned, and deduplication of reports. The foundation of our terms is known, the IE application can be formalized as a general proposed work consists of tasks known to be feasible, but for case of the IR application where “every possible query” is which general-purpose solutions are still being manually adapted enumerated, multiple crawls are conducted in advance, and the to new domains in current practice: automated named entity union of all resulting hits is ranked and filtered. recognition and topic categorization. Typically, information extraction is restricted to named entities (Person, Organization, 3.2.3 Improved Access through Structured Queries Location, and in our domain, Disease), but attributes such as “causative agent” are not always extracted. Neither are dates, and Opinion Pattern Mining times, quantities, and place names that support the extraction of This workflow is designed to provide analysts with better access full tuples of a relationship set. This open problem is of critical to spatiotemporal data. First, it supports approximate range significance and is therefore the first of our specific aims. queries, such as: “return records of persons with fasting blood glucose levels close to the non-diabetic range of < 126 mg/dL”. Second, it uses measures of semantic relatedness or similarity, 4.1.3 Web 2.0 The term Web 2.0 describes an eclectic set of technologies for e.g., “return posts about adverse effects of Metformin whose online interoperability and collaboration. While it includes search, expressed sentiments are closest to those in this post”. Third, it hyperlinking, collaborative authorship and tagging, web services, extracts information in Steps 1 – 3 that in Step 4 can be used to and syndication, our IE approach focuses on the authorship, generate thematic maps, which portray specific aspects of a tagging, and syndication aspects. Collaborative authorship and delivery mechanisms (pen syringes), eliciting opinions from editing are mainstays of specialized wikis, but many forums also fellow users, and specifying a requested comparison between provide tools for collaboration, from discussion threading and named products. Opinions voiced by respondents to this post editing history to user profiles, our main source of demographic then discuss how heat-tolerant each brand is, how quickly it acts, information besides posts. We will crawl or aggregate profile and other aspects that we refer to as facets. Achievement of our data, which in some social network and blogging systems (e.g., primary aims will allow the analyst to chart reported biostatistics LiveJournal) is published as a publicly available feed. Another and opinions, not only about products but about trends, such as source of relational data is the link structure expressed by the number of units of postprandial insulin taken per gram of collaborative tagging, especially annotation by other users cf. CHO. Wikipedia, social bookmarking cf. Delicious, social citation cf. CiteULike, and collaborative recommendation cf. Digg, Reddit, 6. ACKNOWLEDGMENTS and StumbleUpon. We intend to make use of available content Thanks to Chengxiang Zhai for fruitful discussions about the IR management functionality in health wikis and electronic groups. framework. Thanks also to Tim Weninger, Svitlana Volkova, Syndication provides a modern mechanism for refreshing content Surya Teja Kallumadi, Wesam Elshamy, and Andrew Berggren that is generally more efficient than periodic crawls. We will for development work on an early prototype of the information make use of these three categories of Web 2.0 features and other extraction system. available content management functionality to assist in the extraction of relational tuples from free text writings online, and in their validation and ranking. 7. REFERENCES Aljandal, W., Hsu, W. H., Bahirwani, V., & Caragea, D. (2009). 4.1.4 Map and Timeline Visualization Ontology-Aware Classification and Association Rule Mining for Finally, the generation of views as shown in Figure 1 is a key Interest and Link prediction in Social Networks. Proceedings of application of our second primary aim, to build a domain the AAAI 2009 Spring Symposium on the Social Semantic Web. ontology for text mining in diabetes blogs and then develop Menlo Park, CA, USA: AAAI Press. automated mappings from entity recognition systems to this Brownstein, J., & Feifeld, C. (2007). HealthMap – Global ontology, and our third, to extract the objects and polarity of Disease Alert Mapping System. Retrieved January 25, 2010, from opinions. http://www.healthmap.org. Thematic maps, including opinion maps, help reveal global Craven, M., & Kumlien, J. (1999). Constructing biological patterns and trends that may have been previously hidden. By knowledge bases by extracting information from text sources. visualizing the attributes and related entities of a disease and Proceedings of the 7th International Conference on Intelligent depicting their variation across space and time, they allow the Systems for Molecular Biology (pp. 77-86). Menlo Park, CA, user to interactively discover these trends. Most previous USA: AAAI Press. approaches to construction of thematic maps have been based on Jiang, J., & Zhai, C. (2006). Exploiting domain structure for electronic medical records and reports compiled by medical named entity recognition. Proceedings of the Human Language providers. The value added by IE operations that automatically Technology Conference/the North American Chapter of the populate databases and thematic maps is that they can be applied Association for Computational Linguistics (HLT-NAACL 2006), to the large volume of text that is voluntarily submitted on a daily (pp. 74-81). basis to venues listed at the beginning of this section. Jiang, J., & Zhai, C. Instance Weighting for Domain Adaptation in NLP. Proceedings of the 45th Annual Meeting of the 5. EXAMPLE: HEALTH BLOGS Association for Computational Linguistics (ACL 2007), (pp. 264- The primary value added in adapting the IR and IE workflows 271). described above is an increased capability to explore patterns and trends expressed by an entire collection of health blog posts. As a Kim, H. D., & Zhai, C. (2009). Generating Comparative running example, consider the public health analyst who is Summaries of Contradictory Opinions in Text. Proceedings of the interested in charting trends in the use of fast-acting insulin by 18th ACM International Conference on Information and diabetics. Users often share information about the brands of Knowledge Management (CIKM 2009), (pp. 385-394). insulin they use and post opinions about their effectiveness. The Ling, X., Mei, Q., Zhai, C., & Schatz, B. R. (2008). Mining multi- following is a post archived on diabetesforums.com which is faceted overviews of arbitrary topics in a text collection. marked up (with a color coding to distinguish entity types and Proceedings of the 15th ACM SIGKDD International Conference relationship types): on Knowledge Discovery and Data Mining (KDD 2008), (pp. 497- Since I have found out in a previous thread I posted I 505). can use most pen needles with the Novolin 4 pen I Yangarber, R., Steinberger, R., Best, C., von Etter, P., Fuart, F., & got(Haven't used it yet since I have only Humalog for Horby, D. (2007). Combining Information Retrieval and rapids so far), I am here with another question. Information Extraction for Medical Intelligence. NATO Advanced I have only used Humalog for a rapid... Does anyone Study Institute on Mining Massive Data Sets for Security. have any insight as to how it compares to Novolog? Yue, L., & Zhai, C. (2008). Opinion Integration Through Semi- In this post, the user is requesting information comparing two supervised Topic Modeling. Proceedings of the 17th International drug products (brands of fast-acting insulin, referring to specific World Wide Web Conference (WWW 2008).