Opinion Mapping: Information Visualization Approaches
            for Comparative Sentiment Analysis
                       William H.Hsu                                                       Praveen Koduru
                       bhsu@ksu.edu                                                   Praveen.Koduru@gmail.com
                   Kansas State University                                                 IQ Gateway, Inc.
                      +1 785 236 8247

ABSTRACT                                                                    visualization of opinions – specifically, thematic mapping of
In this position paper, we discuss the problem of extracting                opinions. At present, there is a dearth of methods for integrating
information about chronic diseases from the large volume of text            user profile data for social networks with blog posts, tweets, and
written in health blogs, mailing lists, forums, and other electronic        other content from the associated social media. These limitations
venues, then making this information accessible via structured              present an integrative challenge for human-computer interaction
queries, while analyzing it to map out patterns among the                   (HCI) and information retrieval (IR). Towards this end, the
opinions and demographics of users. Information retrieval                   specific aims of the research proposed espoused in this position
systems exist for spatially-referenced demographic data about               paper are as follows:
diseases such as diabetes and their therapies, but as in the case of
                                                                            1.   Aim 1. Extend known algorithms for named entity
communicable diseases, the databases that contain such data are
                                                                                 recognition and relationship extraction, to produce basic
manually populated. For example, in information portals such as
                                                                                 summaries of diseases and treatments mentioned in texts.
HealthMap.org, which are searchable by location and disease, the
                                                                                 The technical objective is to tag where basic entities and
data are user-reported and collaboratively maintained, but not
                                                                                 opinions are mentioned in freely available text (including
automatically extracted from text. Furthermore, there is as yet no
                                                                                 both user posts and profiles), then map these tagged elements
automated means of relating sentiments expressed by users in
                                                                                 in space, time, and by topic, to acceptable levels of precision
their text postings to their semistructured profile data. This is
                                                                                 and recall.
because the primary sources for this kind of information have
been statistical surveys such as opinion polls, where text                  2.   Aim 2. Adapt basic known techniques to the domain of type
responses are often human-interpreted and demographic analysis                   2 diabetes – specifically, extracting data from text
is done post hoc, rather than as part of an information retrieval                discussions of diabetes that are archived from health blogs
and extraction task. These limitations indicate a present need for               and forums using web crawlers. This entails developing a
text summarization techniques that integrate quantitative                        means of handling entities and quantitative data that have not
information extraction – which captures symptoms, diseases, and                  previously been extracted from text, such as information
complications of diseases – with opinion summarization.                          concerning insulin and oral anti-diabetic drug dosage, HbA1c
                                                                                 levels, etc.     Another functional requirement is some
Categories and Subject Descriptors                                               mechanism for entity reference resolution, e.g., abbreviations
H.3.3 [Information Storage and Retrieval]: Information search                    and synonyms, for known terms. Finally, a domain-specific
and retrieval – clustering, relevance feedback, selection process;               ontology of relevant symptoms, disease attributes,
H.2.8 [Database Management]: Database applications – data                        complications, and treatments is proposed. For type 2
mining, spatial databases and GIS                                                diabetes, this includes topics frequently discussed in health
                                                                                 blogs and forums: food groups, meal plans, nutritional
                                                                                 constraints, and conditions such as obesity that are linked to
General Terms                                                                    diabetes.     This shall facilitate information retrieval
Algorithms, Experimentation, Human Factors                                       applications such as question answering about meal plans
                                                                                 recommended by primary care physicians and specialists.
Keywords
sentiment analysis, social networks, geoinformatics, opinion                3.   Aim 3. Develop methods for sentiment analysis and
mining, subjectivity,    information extraction, information                     improve existing ones, to summarize opinions and
visualization, human-computer interaction                                        discover patterns. The technical objective is to relate
                                                                                 demographic data extracted from text and profiles to
                                                                                 qualitative data – namely, the polarity of text at the
1. INTRODUCTION                                                                  document, sentence, or aspect level, aggregated across
In this paper, we address the problem of information retrieval and               demographic categories such as geographic region of
information extraction in subjective domains, with applications to               residence. Objects of interest for sentiment analysis
                                                                                 include prescribed therapies and specifically side effects,
                                                                                 but can extend to disease aspects and complications.
Permission to make digital or hard copies of all or part of this work for   The overall goal of this approach is to develop an integrative
personal or classroom use is granted without fee provided that copies are   technology for summarizing online text about chronic diseases,
not made or distributed for profit or commercial advantage and that         capturing opinions from users’ posts and demographic data from
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
                                                                            a combination of their posts and profiles, and finally using these
requires prior specific permission and/or a fee.
EuroHCIR 2012.
Copyright 2012 ACM
to discover global patterns indicated by the set of all text            to diabetes, but they do not provide the requisite concepts for
documents. The central hypothesis of this work is that a                mining free-form text written by lay users who are discussing
combination of entity and relationship extraction, driven by a          diabetes online. We propose to develop an ontology for text
domain-specific ontology of terms, will result in more precise          mining in diabetes, and the mappings from extracted entities and
and accurate summarization of opinions. This will increase the          relationships into this ontology.
usefulness of free-form text, written by users of social media, in
understanding patterns that are reflected in the opinions and           2.3 Opinion Mining (Sentiment Analysis)
demographics of chronic disease patients.                               This aspect of the proposed work focuses on a basic research
                                                                        problem: sentiment analysis from text, also known as opinion
                                                                        mining, whose objective is to determine from analysis of a written
2. SIGNIFICANCE                                                         document what the author’s attitude towards an identifiable topic
2.1 Information Extraction from Health Blogs                            is. This attitude can be subjective or objective; it can be
The chief potential impact of the research framework and test bed       identified as an evaluation (positive or negative), a declaration of
proposed in Section 1 is to provide assistive technologies to           the author’s emotional attitude, or a expression intended to evoke
public health analysts and health services analysts who are using       an emotional response in the reader. Subjects of interest include
blogs, microblogs (e.g., Twitter), and other social media to            chronic diseases, their features or aspects including symptoms,
explore user opinions about chronic disease issues. As an               complications, and treatments, and related health services.
example, in the application domain of type 2 diabetes, these
include dietary treatments such as carbohydrate control,                2.4 Current State of the Field
complications such as gastroparesis induced by diabetes that may
pose digestive constraints, and recommendations of primary care
physicians, therapists, endocrinologists, nutritionists, etc.
The availability of mailing lists, blogs, wikis, and other electronic
media for content management and dissemination has resulted in
rapid growth in the volume of online text data containing
voluntarily expressed public opinions about health issues. While
general-purpose metadata tools exist for annotating this text, the
opinions themselves remain a largely unexplored source of
information about how chronic diseases affect populations.
Meanwhile, the task of relating content from these various self-
publishing media to semi-structured profile data from their users
has not yet been effectively automated.
We advocate development of application test beds and
experimental systems aimed at improving techniques for                     Figure 1. Prototype event search based on a previous IR
information extraction, ontology development and mapping, and                        system for veterinary epidemiology.
text mining to identify opinion patterns. The potential progress in
                                                                        Figure 1 depicts a simple search interface for an existing IR
these areas is due in part to the approach of combining
                                                                        system developed by the principal investigator’s research group.
information extraction to discover disease mentions with
                                                                        This system was designed for event extraction in the domain of
sentiment analysis to establish opinions, and in part to the
                                                                        viral zoonoses, but uses general-purpose software for web
application of this approach to a new source of data: free-form
                                                                        crawling and ranking (the latter is developed using Lucene Java).
text describing user demographics, attributes of the chronic
                                                                        One marker is displayed on both the thematic map and the
disease of interest and its related entities, and opinions and semi-
                                                                        timeline for each returned page, but the only features extracted by
structured profile data.
                                                                        this system are the disease name, formatted dates and times given
To help public health researchers tap into these freely available       in each article, and locations mentioned in the article.
but unexplored sources of opinions, we propose to develop
                                                                        The thematic map suggests several interactive functions related to
information extraction (IE) and summarization methods geared at
                                                                        opinion mining. One is content-based filtering of articles using
health blog postings and similar text. Such postings contain not
                                                                        the first type of thematic data, demographic and biostatistical
only opinions, and attribution information that can be used to link
                                                                        attributes; another, collaborative filtering using the second type,
them to the users who expressed them, but also factual data about
                                                                        polarity scores. Both of these use associations that can be learned
the posters and their opinions. This data can help place opinions
                                                                        from data: a user can search for articles by entering queries that
in a comparative context with population statistics, such as the
                                                                        express certain sentiments. In the first case, entities and attributes
reporting frequency of symptoms, side effects, and complications.
                                                                        (e.g., symptoms, complications, and treatments) mentioned in the
2.2 Ontology Development                                                query may match frequent patterns in the data; in the second,
The research approach centers around using information                  polarity scores themselves can be used to retrieve their “nearest
extraction to obtain structured data in the form of records about       neighbors in opinion space”.
chronic disease references in text, which are then linked to users
via relational data extracted from their profiles. However, the         3. PROPOSED DIRECTIONS
body of relevant concepts in the healthcare domain and in the           3.1 Mining Social Media (Blogs, Lists, Wikis)
clinical domain theory of each chronic disease is much broader.         As mentioned above in Section 2, our approach applies text
Currently there exist pre-clinical (genomic and proteomic) and          mining to blogs and social media, a new source of information
clinical translational ontologies that contain information relevant     that is beginning to be studied for opinion and trending topic data,
but has not been analyzed for disease-related information that can      geographic region. In this research, the themes fall into two
be related to these data. The novelty of our approach is that it        categories: the first, demographic attributes and biostatistics
extends named entity recognition and relationship extraction to         specified by the ontology – some disease-independent, and some
the domain of understanding free-form text about aspects of             disease-specific; the second, quantized measures of opinion
chronic diseases (specifically, opinions about type 2 diabetes, its     polarity (i.e., degree of positive or negative sentiment). The
complications, dietary recommendations, and drug treatments). It        increased support for flexible queries and thematic map
further develops methods for mapping these new entities and             generation, compared to IR without relationship extraction and
relationships to the terms of an ontology for text mining, and          sentiment analysis, will help reveal patterns in the data through
finally leverages the text contained in many online sources to          interactive investigation.
produce integrative summaries of disease mentions and associated
opinions.                                                               4. TECHNICAL FOCUS AREAS
                                                                        4.1 Improvements and Refinements to Theory
3.2 New Theory and Methodology                                          The following generic methods are applied in order to meet the
3.2.1 IR (Search Query-Driven) Workflow                                 functional requirements presented in the preceding section. We
                                                                        refer to them as cross-cutting because they are used in service to
In IR applications of automatic text summarization, a user enters a
                                                                        all of the technical aims: entity and relationship extraction,
free-form search query and views returned hits that are
                                                                        ontology development, and sentiment analysis.
summarized by topic and aspect – in this case, author opinion.
These hits may be organized by space and time. For example,
consider the case of a clinical health services analyst, public         4.1.1 Focused Crawling
health analyst, doctor, patient, or other concerned individual who      In previous work on IE, applied to news summarization in the
is interested in some aspect of a chronic disease. Such a user          domain of veterinary epidemiology, we used a combination of
typically enters a query into a general-purpose search engine and       topical and focused crawling. Topical crawling prioritizes pages
is either directed to a domain-specialized web portal, also called a    to be crawled based on user-provided terms (i.e., topics) and seeds
vertical portal, or browses through documents housed in one.            (i.e., links to initial pages), while focused crawling uses both
                                                                        terms and pages labeled as positive or negative examples of
We seek to advance the state of the field by supporting structured      relevant documents. Once tag-formatted web documents (HTML
queries, in which a user specifies fields and constraints in addition   or XML) are crawled, text must be extracted from them.
to traditional search keywords. This is achieved by combining
quantitative text summarization (extraction of attribute values)        The on-demand IR system described above functions by passing
with recognition of entities and relationships. The collection of       the user query to a built-in web crawler that fetches hits from a
documents may include some that are dynamically crawled from            commercial search engine (in this case, Yahoo). The results are
the web in response to the query. The output consists of                combined with previously crawled documents, if any, and ranked
structured tuples that are ranked by relevance to the query,            and indexed as a whole.
filtered to remove hits deemed insufficiently relevant, and finally
visualized in a map or timeline view. This view allows the user to      4.1.2 Information Extraction
more freely explore information by performing interactive               The state of the field in IE for web articles describing disease
manipulations such as online analytical processing or editing the       consists of: payload extraction (of text from HTML), baseline
set of constraints.                                                     named entity recognition (NER), and extraction of dates, times,
                                                                        and locations in order to localize putative events. In addition to
3.2.2 IE and Summarization (Push) Workflow                              general open natural language processing problems such as co-
IE applications of the proposed summarization technology can be         reference resolution (in particular, pronouns and other anaphora),
viewed as a more passive variant of the IR application described        word sense disambiguation, and canonicalization of dates, other
above, from the user’s point of view. No initial query is supplied      IE problems that remain unsolved include: resolving alternative
by the user, but there is an implicit domain of interest from which     abbreviations and synonyms for diseases, disambiguation of place
records should be displayed, corresponding to a combined set of         names, associating quantities of persons affected with diseases
search terms and relevance criteria. When a small set of search         mentioned, and deduplication of reports. The foundation of our
terms is known, the IE application can be formalized as a general       proposed work consists of tasks known to be feasible, but for
case of the IR application where “every possible query” is              which general-purpose solutions are still being manually adapted
enumerated, multiple crawls are conducted in advance, and the           to new domains in current practice: automated named entity
union of all resulting hits is ranked and filtered.                     recognition and topic categorization. Typically, information
                                                                        extraction is restricted to named entities (Person, Organization,
3.2.3 Improved Access through Structured Queries                        Location, and in our domain, Disease), but attributes such as
                                                                        “causative agent” are not always extracted. Neither are dates,
and Opinion Pattern Mining                                              times, quantities, and place names that support the extraction of
This workflow is designed to provide analysts with better access        full tuples of a relationship set. This open problem is of critical
to spatiotemporal data. First, it supports approximate range            significance and is therefore the first of our specific aims.
queries, such as: “return records of persons with fasting blood
glucose levels close to the non-diabetic range of < 126 mg/dL”.
Second, it uses measures of semantic relatedness or similarity,
                                                                        4.1.3 Web 2.0
                                                                        The term Web 2.0 describes an eclectic set of technologies for
e.g., “return posts about adverse effects of Metformin whose
                                                                        online interoperability and collaboration. While it includes search,
expressed sentiments are closest to those in this post”. Third, it
                                                                        hyperlinking, collaborative authorship and tagging, web services,
extracts information in Steps 1 – 3 that in Step 4 can be used to
                                                                        and syndication, our IE approach focuses on the authorship,
generate thematic maps, which portray specific aspects of a
tagging, and syndication aspects. Collaborative authorship and         delivery mechanisms (pen syringes), eliciting opinions from
editing are mainstays of specialized wikis, but many forums also       fellow users, and specifying a requested comparison between
provide tools for collaboration, from discussion threading and         named products. Opinions voiced by respondents to this post
editing history to user profiles, our main source of demographic       then discuss how heat-tolerant each brand is, how quickly it acts,
information besides posts. We will crawl or aggregate profile          and other aspects that we refer to as facets. Achievement of our
data, which in some social network and blogging systems (e.g.,         primary aims will allow the analyst to chart reported biostatistics
LiveJournal) is published as a publicly available feed. Another        and opinions, not only about products but about trends, such as
source of relational data is the link structure expressed by           the number of units of postprandial insulin taken per gram of
collaborative tagging, especially annotation by other users cf.        CHO.
Wikipedia, social bookmarking cf. Delicious, social citation cf.
CiteULike, and collaborative recommendation cf. Digg, Reddit,          6. ACKNOWLEDGMENTS
and StumbleUpon. We intend to make use of available content            Thanks to Chengxiang Zhai for fruitful discussions about the IR
management functionality in health wikis and electronic groups.        framework. Thanks also to Tim Weninger, Svitlana Volkova,
Syndication provides a modern mechanism for refreshing content         Surya Teja Kallumadi, Wesam Elshamy, and Andrew Berggren
that is generally more efficient than periodic crawls. We will         for development work on an early prototype of the information
make use of these three categories of Web 2.0 features and other       extraction system.
available content management functionality to assist in the
extraction of relational tuples from free text writings online, and
in their validation and ranking.                                       7. REFERENCES
                                                                       Aljandal, W., Hsu, W. H., Bahirwani, V., & Caragea, D. (2009).
4.1.4 Map and Timeline Visualization                                   Ontology-Aware Classification and Association Rule Mining for
Finally, the generation of views as shown in Figure 1 is a key         Interest and Link prediction in Social Networks. Proceedings of
application of our second primary aim, to build a domain               the AAAI 2009 Spring Symposium on the Social Semantic Web.
ontology for text mining in diabetes blogs and then develop            Menlo Park, CA, USA: AAAI Press.
automated mappings from entity recognition systems to this             Brownstein, J., & Feifeld, C. (2007). HealthMap – Global
ontology, and our third, to extract the objects and polarity of        Disease Alert Mapping System. Retrieved January 25, 2010, from
opinions.                                                              http://www.healthmap.org.
Thematic maps, including opinion maps, help reveal global              Craven, M., & Kumlien, J. (1999). Constructing biological
patterns and trends that may have been previously hidden. By           knowledge bases by extracting information from text sources.
visualizing the attributes and related entities of a disease and       Proceedings of the 7th International Conference on Intelligent
depicting their variation across space and time, they allow the        Systems for Molecular Biology (pp. 77-86). Menlo Park, CA,
user to interactively discover these trends. Most previous             USA: AAAI Press.
approaches to construction of thematic maps have been based on         Jiang, J., & Zhai, C. (2006). Exploiting domain structure for
electronic medical records and reports compiled by medical             named entity recognition. Proceedings of the Human Language
providers. The value added by IE operations that automatically         Technology Conference/the North American Chapter of the
populate databases and thematic maps is that they can be applied       Association for Computational Linguistics (HLT-NAACL 2006),
to the large volume of text that is voluntarily submitted on a daily   (pp. 74-81).
basis to venues listed at the beginning of this section.
                                                                       Jiang, J., & Zhai, C. Instance Weighting for Domain Adaptation
                                                                       in NLP. Proceedings of the 45th Annual Meeting of the
5. EXAMPLE: HEALTH BLOGS                                               Association for Computational Linguistics (ACL 2007), (pp. 264-
The primary value added in adapting the IR and IE workflows
                                                                       271).
described above is an increased capability to explore patterns and
trends expressed by an entire collection of health blog posts. As a    Kim, H. D., & Zhai, C. (2009). Generating Comparative
running example, consider the public health analyst who is             Summaries of Contradictory Opinions in Text. Proceedings of the
interested in charting trends in the use of fast-acting insulin by     18th ACM International Conference on Information and
diabetics. Users often share information about the brands of           Knowledge Management (CIKM 2009), (pp. 385-394).
insulin they use and post opinions about their effectiveness. The      Ling, X., Mei, Q., Zhai, C., & Schatz, B. R. (2008). Mining multi-
following is a post archived on diabetesforums.com which is            faceted overviews of arbitrary topics in a text collection.
marked up (with a color coding to distinguish entity types and         Proceedings of the 15th ACM SIGKDD International Conference
relationship types):                                                   on Knowledge Discovery and Data Mining (KDD 2008), (pp. 497-
     Since I have found out in a previous thread I posted I            505).
     can use most pen needles with the Novolin 4 pen I                 Yangarber, R., Steinberger, R., Best, C., von Etter, P., Fuart, F., &
     got(Haven't used it yet since I have only Humalog for             Horby, D. (2007). Combining Information Retrieval and
     rapids so far), I am here with another question.                  Information Extraction for Medical Intelligence. NATO Advanced
     I have only used Humalog for a rapid... Does anyone               Study Institute on Mining Massive Data Sets for Security.
     have any insight as to how it compares to Novolog?                Yue, L., & Zhai, C. (2008). Opinion Integration Through Semi-
In this post, the user is requesting information comparing two         supervised Topic Modeling. Proceedings of the 17th International
drug products (brands of fast-acting insulin, referring to specific    World Wide Web Conference (WWW 2008).