=Paper=
{{Paper
|id=None
|storemode=property
|title=Data in Context: Aiding News Consumers while Taming Dataspaces
|pdfUrl=https://ceur-ws.org/Vol-1025/vision3.pdf
|volume=Vol-1025
|dblpUrl=https://dblp.org/rec/conf/dbcrowd/00020M13
}}
==Data in Context: Aiding News Consumers while Taming Dataspaces==
<pdf width="1500px">https://ceur-ws.org/Vol-1025/vision3.pdf</pdf>
<pre>
                      DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


                      Data In Context: Aiding News Consumers
                              while Taming Dataspaces

                                                                 ∗
                                          Adam Marcus , Eugene Wu, Sam Madden
                                                                      MIT CSAIL
                                            {marcua, sirrice, madden}@csail.mit.edu


...were it left to me to decide whether we should have a gov-                     reasons: 1) A lack of space or time, as is common in minute-
ernment without newspapers, or newspapers without a gov-                          by-minute reporting 2) The article is a segment in a multi-
ernment, I should not hesitate a moment to prefer the latter.                     part series, 3) The reader doesn’t have the assumed back-
— Thomas Jefferson                                                                ground knowledge, 4) A newsroom is resources-limited and
                                                                                  can not do additional analysis in-house, 5) The writer’s
ABSTRACT                                                                          agenda is better served through the lack of context, or 6)
                                                                                  The context is not materialized in a convenient place (e.g.,
We present MuckRaker, a tool that provides news consumers                         there is no readily accessible table of historical earnings).
with datasets and visualizations that contextualize facts and                     In some cases, the missing data is often accessible (e.g, on
figures in the articles they read. MuckRaker takes advantage                      Wikipedia), and with enough effort, an enterprising reader
of data integration techniques to identify matching datasets,                     can usually analyze or visualize it themselves. Ideally, all
and makes use of data and schema extraction algorithms to                         news consumers would have tools to simplify this task.
identify data points of interest in articles. It presents the                        Many database research results could aid readers, par-
output of these algorithms to users requesting additional                         ticularly those related to dataspace management. Matching
context, and allows users to further refine these outputs.                        records from articles with those in a relation is an entity res-
In doing so, MuckRaker creates a synergistic relationship                         olution problem; aggregating contextual information from
between news consumers and the database research commu-                           multiple sources is a schema matching and data integration
nity, providing training data to improve existing algorithms,                     problem; searching for the missing data is a deep web prob-
and a grand challenge for the next generation of dataspace                        lem; extracting relations from web-pages and text is solved
management research.                                                              by projects like TextRunner [9] and Webtables [3]. While
                                                                                  these efforts have pushed the limits of automated solutions,
1.    INTRODUCTION                                                                recent human-assisted approaches present new opportuni-
   One of the basic mechanisms through which end-users                            ties. Existing automated algorithms can be semi-automated
consume and interact with data is by reading a news source.                       by asking humans to vet algorithmic results and iteratively
Many articles are based on one or a few data points, be it                        improve the algorithms over time. We believe news is an
the earnings of a company, new unemployments numbers                              ideal match for this problem.
for a country, or the number of people at a political rally.                         We believe that, given the promise of contextualized arti-
Infographics compactly present a larger set of points, typi-                      cles, readers would be willing to answer a small number of
cally through aggregate statistics grouped temporally or by                       simple questions. For example, we might ask a user to high-
category. Investigative efforts often uncover and join sev-                       light the data in question, identify related datasets, ensure
eral datasets to deliver a story. In business reporting, the                      the data in an article properly maps to a dataset, or to select
data is even more apparent: companies like Reuters and                            a visualization. Fortuitously, each small task generates in-
Bloomberg are successful in part because they generate and                        puts for semi-automated dataspace management algorithms.
serve enormous amounts of data. In each of these examples,                        Readers can also benefit from previous users’ answers and
the end product—an article or a graphic—is conceptually                           view previously generated context without additional effort.
a view over a dataset. For example, when reporting the                               We view data in context as a grand challenge in dataspace
earnings of a specific company, the article presents a view                       management. If we design algorithms that are good enough
of a dataset of all company earnings in the same quarter, or                      to be guided by an average reader and provide the context
earnings of the company during previous quarters.                                 they lack, then both society and the database community
   In this light, an article’s background includes tuples out-                    benefit. In addition, journalists can use the same tools to
side of the existing view; access to extra information would                      proactively identify relevant datasets or visualizations.
allow the reader to better understand the article’s context.                         We envision a proof-of-concept user interface for an arti-
However, articles may miss key contextual cues for many                           cle contextualization service called MuckRaker. It serves as
                                                                                  a front-end to the numerous database problems mentioned
∗Eugene and Adam contributed to this paper equally.
                                                                                  above. We have developed it as a bookmarklet for web
Copyright c 2013 for the individual papers by the papers’ authors. Copying        browsers that allows users to select a data point on a web-
permitted for private and academic purposes. This volume is published and         page, answer a few questions about its context and origin,
copyrighted by its editors.                                                       and see a visualization of a dataset that contextualizes their


                                                                             1
                                                                             47
                  DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


reading experience. In this paper we outline:
   1. A user interface for interacting with data items on the
      web that contextualizes them,
   2. A study of the difficult dataspace management prob-
      lems average news consumers can help solve, and
   3. A collection of the challenges posed by data embedded
      in articles that span the web.


2.    TALE OF TWO COMMUNITIES
   The problem of contextualizing data lies at the intersec-
tion of two communities—news consumers and the database
community—and can benefit both. As a social endeavor, we
believe it encourages the general population to interact with
data. MuckRaker not only helps curious readers better un-                        (a) MuckRaker bookmarklet.
derstand the news, but can help users spot biased attempts
to generalize a single outlier data value as the norm. It serves
the data management systems by helping clean, normalize,
and aggregate data that consumers care about.
   The unfettered MuckRaker vision encompasses several open
problems facing the database community. We believe vari-
ants of the problem are tractable and can be solved with
careful application of existing approaches. In the rest of
this section, we illustrate how a constrained instance of this
problem can be solved for an example article that centers on
a car bombing in Iraq1 . We first describe MuckRaker from a
user’s perspective, then explore a viable system implementa-
tion, and finally explore extensions to increase MuckRaker’s
utility and introduce interesting research problems.                             (b) User selects data in text.
2.1    A Tool for News Consumers
   Suppose a user reads an article about a car bomb in Iraq
and wants to know the scale of similar attacks. She clicks
the MuckRaker bookmarklet (Figure 1a), which asks her to
select the region of text that she would like contextualized.
The user highlights the sentence “Twin car bombs exploded
in central Baghdad on Tuesday, killing at least 19 people.”
MuckRaker extracts key entities and values around the high-
lighted region, and presents the data to the user (Figure 1b).
MuckRaker could not identify attribute names for the first
and last column, and prompts the user to fill them in. The
attribute name in the third column is not precise enough,
so the user can click it to edit the name. When the user is
satisfied, she clicks “Find Context.”
   MuckRaker receives a request containing the highlights
and extracted data. MuckRaker finds contextual informa-
tion about the article using article type classification, fact,
entity and numerical extraction, and other natural language
processing techniques. This information, weighted by its                         (c) Tables and article context.
distance to the selected text, is fed to algorithms described
in Section 2.2. The information and the tables ranked most
relevant by the algorithms are presented to the user (Figure
1c). The user can select the desired table, or update this
information and re-execute the matching algorithms.
   In this example, the first table is a list of mass bombings2
and the second is of civilian casualties during the first two
weeks of the Iraq war3 . If a matching row exists, MuckRaker
will attempt to populate empty fields with new data values.
1
  http://www.nytimes.com/2012/08/01/world/
middleeast/twin-car-bombs-hit-central-baghdad.html
2
  http://cursor.org/stories/iraq.html
3
  http://en.wikipedia.org/wiki/List_of_mass_car_
bombings
                                                                                       (d) Visualizations.

                                                                   2    Figure 1: The MuckRaker user interface.
                                                                   48
                   DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


In this case, neither table contains the user-selected data                 The user can facilitate record extraction, but values may
point, so MuckRaker inserts the data into both tables, and               be named, formatted, or scaled inconsistently with existing
fills as many of the matched fields as possible. The row                 tables. With a tighter human-in-the-loop training cycle, we
is highlighted for the user so that she can optionally fill              have more hope for improving such extraction anomalies.
in missing fields. She is interested in mass bombings, so                   Another classification challenge lies in identifying the type
selects the first table, corrects the date field, and fills in           of context that the user interested in. She may want to see
the type column with “car bomb,” which is autocompleted.                 an IBM earnings report in the context of historical earnings
When she clicks “Use this table,“ the updated row is sent                instead of similar technology companies (select the appropri-
to MuckRaker, and she is shown condidate visualizations.                 ate attributes). Alternatively, a European reader may prefer
    MuckRaker selects an initial visualization using the se-             to see European companies rather than American companies
lected dataset. Previous users commonly created timeseries               (select the best records). Subdividing context automatically
and maps, and Figure 1d shows these top plots: deaths by                 according to user preferences is a key challenge.
date and deaths by location. It highlights the new data                  Structured Search. Das Sarma et al. recently studied
point to identify contextual data (the red points). The user             related table search [5], where a user specifies a table and
can specify that she is interested in different columns and              the search engine identifies others that either extend the ta-
construct her own chart through a chart builder, where she               ble horizontally or vertically. MuckRaker requires a similar
can specify chart types, axes, filters, and facets. MuckRaker            search, but uses partial examples extracted from article text.
stores the visualization configuration that the user picked,                In addition, identifying the table is not enough. To be use-
and uses it to bootstrap future requests.                                ful, tables must be transformed (e.g., currency conversion),
                                                                         projected, filtered (e.g., identify small number of representa-
2.2    Database Community                                                tive rows), and aggregated (e.g., aggregate county statistics
  We believe that data in context can be implemented at a                to report state granularity statistics). Learning these steps
basic but useful level using existing technology. We hope the            is another research challenge.
database community can use the core system as a starting                 Visualization. Automated Visualization selection is diffi-
point for numerous research problems. In the rest of this                cult because it is both a dimensionality reduction problem
section, we sketch a basic implementation, and then describe             and a design problem. Historical earnings are best plotted
how user interactions can both improve existing algorithms               as a time series, while the comparative earnings of similar
with training data, and introduce new challenges that must               companies is better represented with a bar chart. A human
be addressed by future data contextualizing systems.                     in the loop would better assist and train these decisions.
                                                                            MuckRaker can gather large volumes of training data of
2.2.1 Core Implementation                                                user-perferred columns based on the final visualizations that
   The MuckRaker interface is implemented as a browser                   the user selects. One project that has facilitated user-driven
bookmarklet (Javascript that operates on the current web                 visualization construction is the ManyEyes project [8], and
page). We assume that we start with a reasonable collection              we can use its findings as a basis for our design.
of tables that are clean, deduplicated, and complete. In ad-             Data Management. The projects that have most closely
dition, table metadata includes the surrounding text (e.g.,              integrated these individual research problems into a larger
Wikipedia article text). This data can be bootstrapped from              data integration and search system are TextRunner [9] and
sites like Wikipedia and groups such as the world bank, or               WebTables [3]. To the extent that these projects have been
through techniques like those in work by Cafarella et al [3].            evaluated by how much deep web data can be added to web
   User-selected text is sent to a backend server, which ex-             search indices, we think that the grand challenge raised by
tracts numbers and their units (e.g., $, miles), dates, and              contextualizing data serves as a higher bar. While indexing
known entities. Entity extraction algorithms such as Know-               websites by the data they store is useful, being able to re-
ItAll [6] can identify the key nouns and topics. We can ask              trieve datasets that are relevant to a user’s current context
the user to clean and highlight the extracted values.                    would be even more powerful. The WebTables authors re-
   The set of possible tables is filtered by clustering the arti-        alize this as well: Fusion Tables [7] surfaces the data found
cle text with table metadata. For example, an article related            in web tables and other datasets directly in search results,
to the Iraq war will match tables extracted from Wikipedia               suggesting that search-based structured data retrieval is a
articles about Iraq. We can further reduce and rank tables               meaningful measure of the effectiveness of these techniques.
by canonicallizing user-specified attribute names using tech-
niques similar to those used by Das Sarma et al [5] to per-
form schema ranking. A final ranking comes from comparing                3.   GENERALIZING MUCKRAKER
the table values with those in the user-extracted record.                   In the interface described above, all user actions succeeded:
                                                                         the user found the correct dataset, the new record was rea-
2.2.2 Research Problems                                                  sonably extracted, there were no duplicate records in the
   The user interaction provides a number of strong signals              table, and roughly one relevant record was extracted from
that make for interesting research problems. We describe                 the article. We now consider more challenging cases, and
some problems in the context of interactions in the extrac-              how the user interface can be augmented to handle them.
tion, selection, and visualization phases of the user workflow.          No matching datasets. Consider a situation where the
Data Extraction and Integration. The user-selected                       user highlights some text and clicks “Search,” no useful
text explicitly defines record boundaries. The collection of             datasets are returned. In these situations, the user can uti-
all user highlights can be used to train classifiers to detect           lize the search bar in Figure 1c to enter her own keywords,
strings that contain records, and focus the analysis that data           and potentially find the dataset. Failing that, she can click
record extractors like TextRunner [9] need to perform.                   on “I can’t find a dataset,” and be prompted to either: 1)


                                                                    3
                                                                    49
                   DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing


Point at a webpage containing the dataset, or 2) Specify col-           user interface can act as a boon to database research while
umn headings (the schema) and row that can be extracted                 benefitting another community.
from this document in a spreadsheet interface. In scenario                 There have been other calls to arms in the database com-
1, an automated extractor can present the user with the                 munity to assist the journalism process. Most prominently,
newly extracted dataset, and in scenario 2, MuckRaker can               Cohen et al. outlined many ways in which computational
search again for a matching dataset with the given schema               journalism can be aided by database research in areas such
and data point. Should this final search fail, the user will            as fact checking and hypothesis finding [4]. The PANDA
be invited to add more entries to the spreadsheet.                      project [1] aims to provide centralized dataset storage and
Incorrect extraction. In situations where a dataset is                  search functionality within newsrooms. DataPress makes it
correctly identified but records are extracted incorrectly, the         easier to embed structured data and visualizations into blog
user can edit the row in the familiar spreadsheet interface of          posts [2]. MuckRaker approaches the journalism-data inter-
Figure 1c. It is possible that a user will incorrectly add field        face from a different perspective: it seeks to aid news con-
values, but MuckRaker aggregates multiple user corrections              sumers in situations where the Journalism process has left
before trusting any one user’s input.                                   them with an incomplete picture of the world. It can also
Duplicate data. Duplicate rows within a table can arise                 help journalists and editors preempt this problem by helping
if multiple users submit different articles about the same              them find contextualizing datasets and visualizations.
event. We can handle these by calculating a similarity mea-                A key question in designing the MuckRaker experience is
sure between rows. For any newly added row that is above                whether the interface we are designing is lightweight enough,
some threshold similarity to an existing row, we can ask a              or whether we are asking for too much from any one user.
user to verify that the user indeed means to add a new data             In exchange for context behind an article, we believe users
point. Data duplicated across tables requires more care. We             are willing to answer a few small questions, mostly through
wish to know when a table should be merged with another                 point-and-click interfaces. If it turns out that we are asking
table, which might happen when enough rows between two                  too much from each user, however, we can design interfaces
tables are similar. In this situation, we can ask a user during         that load-balance the data integration, extraction, and vi-
dataset search (Figure 1c) whether the table they selected is           sualization tasks across users, especially in scenarios where
the same as another one. If the user indicates that it is, they         multiple users are reading the same article.
are then presented with the columns of both tables aligned                 In presenting MuckRaker, we hope to bridge the gap be-
by a schema mapping algorithm, and invited to re-arrange                tween end-users and deep data exploration. We hope that
the mapping as they see fit. The system can merge two                   the database community is excited to improve on its algo-
tables if enough users mark them as merge candidates.                   rithms with help from the average news consumer.
Article-embedded datasets. So far, our user has high-
lighted a sentence that roughly translates to a single record           5.   REFERENCES
in a table. It is often the case that an article discusses more
                                                                        [1] The PANDA project, August 2012.
than one data point. For example, an article that describes
                                                                            http://pandaproject.net/.
a trend essentially embeds multiple points from a timeseries
into a dataset. Alternatively, an article summarizing a study           [2] E. Benson, A. Marcus, F. Howahl, and D. Karger.
that compares multiple groups of people would embed data                    Talking about data: Sharing richly structured
about each group. Summarizing all of the extracted points                   information through blogs and wikis. In ISWC. 2010.
in a table might be cumbersome for the user. It might be                [3] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and
simpler to summarize the extacted data in a visualization,                  Y. Zhang. WebTables: exploring the power of tables on
allowing the user to drag the points of a timeseries to match               the web. Proc. VLDB Endow., 2008.
a trend, or move the bars in a bar graph to represent the               [4] S. Cohen, C. Li, J. Yang, and C. Yu. Computational
relative differences between groups.                                        journalism: A call to arms to database researchers. In
Uncertain facts. It is often the case that the news cov-                    CIDR, 2011.
ers facts that contradict one-other (e.g., “Prior link between          [5] A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee,
cancer and fruit juice challenged in latest research”). Other               F. Wu, R. Xin, and C. Yu. Finding related tables. In
facts might simply expire over time. For example, records                   SIGMOD, 2012.
that refer to “The President, aged 51” refer to a different             [6] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu,
president or an incorrect age depending on the date of the                  T. Shaked, S. Soderland, D. S. Weld, and A. Yates.
article. A user overseeing the data extraction that knew                    Unsupervised named-entity extraction from the web:
good schema design practices (e.g., storing The President’s                 an experimental study. Artif. Intell., 165(1):91–134,
date of birth rather than an age) could have avoided some                   June 2005.
of these issues, but MuckRaker does not leave schema de-                [7] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen,
sign to expert database designers. To handle these types of                 J. Madhavan, R. Shapley, W. Shen, and
expiration and uncertainty, attaching a source and date to                  J. Goldberg-Kidon. Google fusion tables: web-centered
extracted data may help, as would periodically asking users                 data management and collaboration. In SIGMOD,
whether certain records are still valid in a table.                         2010.
                                                                        [8] F. B. Viégas, M. Wattenberg, F. van Ham, J. Kriss, and
4.   CONCLUSION                                                             M. M. McKeon. ManyEyes: a site for visualization at
   The core contribution of MuckRaker is to utilize a mixed-                internet scale. IEEE Trans. Vis. Comput. Graph., 2007.
initiative interface to improve dataspace management oper-              [9] A. Yates, M. Cafarella, M. Banko, O. Etzioni,
ations as a byproduct of contextualizing the news. It would                 M. Broadhead, and S. Soderland. TextRunner: open
be interesting to see where else the insertion of a lightweight             information extraction on the web. In ACL, 2007.


                                                                   4
                                                                   50

</pre>