=Paper= {{Paper |id=Vol-2950/paper-09 |storemode=property |title=Exploring Datasets via Cell-Centric Indexing |pdfUrl=https://ceur-ws.org/Vol-2950/paper-09.pdf |volume=Vol-2950 |authors=Jeff Heflin,Brian D. Davison,Haiyan Jia |dblpUrl=https://dblp.org/rec/conf/desires/Heflin0J21 }} ==Exploring Datasets via Cell-Centric Indexing== https://ceur-ws.org/Vol-2950/paper-09.pdf
Exploring Datasets via Cell-Centric Indexing
Jeff Heflin1 , Brian D. Davison1 and Haiyan Jia2
1
    Computer Science & Engineering, Lehigh University, 113 Research Dr., Bethlehem, PA, 18015, USA
2
    Journalism and Communication, Lehigh University, 33 Coppee Dr., Bethlehem, PA, 18015, USA


                                             Abstract
                                             We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that
                                             enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this
                                             with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our
                                             interface provides users with an overview of a dataset repository, and allows them to efficiently use various facets to explore
                                             the collection and identify datasets that match their interests.

                                             Keywords
                                             cell-centric indexing, dataset search, exploratory interface,



1. Introduction                                                                                                       not be able to determine if the table is a perfect match
                                                                                                                      until they have downloaded the (potentially large) table.
The twenty-first century has experienced an informa-                                                                     Studies have shown that understanding what is inside
tion explosion; data is growing exponentially and users’                                                              the content of a dataset, rather than simply the dataset
information retrieval needs are becoming much more                                                                    descriptions and metadata, could be critical for their eval-
complicated [1]. Given people’s increasing interests in                                                               uation of whether any of the search results sufficiently
datasets, there is a need for user-friendly search services                                                           matches the search need, especially for non-expert users.
for data journalists, scientists, decision makers, and the                                                            For instance, a recent user study [2] has revealed that
general public to locate datasets that can meet their data                                                            query refinement, as a result of unsatisfying search re-
needs.                                                                                                                sults, is negatively associated with user experience with
   Even though users, under many circumstances, are                                                                   the dataset search tools. What reduces the need for query
not experts in the domain in which they search, they                                                                  refinement is a preview of the dataset content, which
should be able to easily use such an application; the query                                                           helps users gauge the relevance of the datasets. Simi-
process should be responsive and efficient. The result                                                                larly, an experimental study that explores novel dataset
should provide a general picture of what the dataset is                                                               search engine prototypes has found that interfaces with
about, and offer enough information for the searcher to                                                               a content preview feature were perceived as more usable.
know how likely the dataset will contain data that they                                                               In particular, non-expert users reported greater benefits
look for.                                                                                                             from the content preview, as they rated the interfaces
   Traditional database management systems group data                                                                 with higher levels of usefulness, ease of use, usability,
by tables and then organize this data into rows and                                                                   and technology adoption intention, than expert users
columns. When users are aware of the database schema,                                                                 [3]. These indicate the strong need for understanding
they can construct queries, but what if users are simply                                                              the actual content of datasets, even at the cell level.
trying to find which tables in a large data lake are rel-                                                                To enable sufficient query refinement for schema-
evant to their needs? One approach is to simply index                                                                 optional queries, we present the novel concept of cell-
information about the table in various fields: e.g., title,                                                           centric indexing. The key idea is that we use individuals
description, columns, etc. While this approach may be                                                                 cells of a table as the fundamental unit and build inverted
sufficient for some queries, in many cases the user will                                                              indices on these cells. These indices provide different
                                                                                                                      fields that index both the content of the cell and its con-
DESIRES 2021 – 2nd International Conference on Design of                                                              text. For our purposes, the context includes other cell
Experimental Search & Information REtrieval Systems, September                                                        values in the same row, the name of the column (if avail-
15–18, 2021, Padua, Italy                                                                                             able), and metadata about the containing dataset. This
" heflin@cse.lehigh.edu (J. Heflin); davison@cse.lehigh.edu (B. D.
                                                                                                                      approach allows us to refine our search by row descrip-
Davison); haiyan.jia@lehigh.edu (H. Jia)
~ http://www.cse.lehigh.edu/~heflin/ (J. Heflin);                                                                     tors, column descriptors or both at the same time. In
http://www.cse.lehigh.edu/~brian/ (B. D. Davison);                                                                    essence we free the data from how it is structured, and
https://journalism.cas.lehigh.edu/content/haiyan-jia (H. Jia)                                                         schema information, when available, is merely one of the
 0000-0002-7290-1495 (J. Heflin); 0000-0002-9326-3648 (B. D.                                                         many ways to locate the data of interest. Thus, we take
Davison); 0000-0002-8388-7860 (H. Jia)                                                                                the view that fundamentally, users are searching for spe-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        cific data (i.e., particular cells or collections thereof), and
the tables are merely artifacts of how the data is stored.       Derthick et al. [8] describe a visual query language
   We recognize that this approach also has downsides.        that dynamically links queries and visualizations. The
In particular, an index of cells (and their contexts) will    system helps a user to locate information in a multi-
incur substantial storage overhead in comparison to an        object database, illustrates the complicated relationship
index of dataset metadata. Moreover, if the desired search    between attributes of multiple objects, and assists the
result is one or more datasets, at run-time there will be     user to clearly express their information retrieval needs
additional processing to assemble the cell-specific results   in their queries. Similarly, Yogev et al. [9] demonstrate an
to enable the retrieval and ranking at that level of granu-   exploratory search approach for entity-relationship data,
larity. However, our cell-centric approach gives us some      combining an expressive query language, exploratory
additional flexibility and we believe that good system de-    search, and entity-relationship graph navigation. Their
sign, appropriate data structures, and efficient algorithms   work enables people with little to no query language
can ameliorate the costs.                                     expertise to query rich entity-relationship data.
   This paper incorporates material previously presented         In the domain of web search, Koh et al. [10] devise a
in a poster [4] and a workshop [5]. Our contributions of      user interface that supports creativity in education and
this paper are:                                               research. The model allows users to send their query to
                                                              their desired commercial search engine or social platform
     • We propose cell-centric indexing as an innovative      in iterations. As the system goes through each iteration,
       approach to an information retrieval system. A         it will combine the text and image results into a compo-
       cell-centric index enables a user to find data with-   sition space. Addressing a similar problem, Bozzon et
       out having to know the pre-existing structure of       al. [11] design an interactive user interface that employs
       each table;                                            exploratory search and Yahoo! Query Language (YQL)
     • We describe the mechanisms of one implementa-          to empower users to iteratively investigate results across
       tion of a cell-centric dataset search engine. We       multiple sources.
       describe the structure and method of data storage         A tag cloud is a common and useful visualization of
       and querying of our server; and,                       data that represents relative importance or frequency via
     • We describe a novel prototype interface that lever-    size. Some researchers have adapted this idea to visualize
       ages cell-centric indexing in order to give users      query results. Fan et al. [12] focus on designing an in-
       summaries of a dataset repository in terms of ti-      teractive user interface with image clouds. The interface
       tles, content, and column names. The user can          enables users to comprehend their latent query inten-
       filter on any of these facets to generate more spe-    tions and direct the system to form their personalized
       cific summaries.                                       image recommendations. Dunaiski et al. [13] design and
                                                              evaluate a search engine that incorporates exploratory
   The rest of the paper is organized as follows: we first
                                                              search to ease researchers’ scouting for academic publica-
discuss related work, briefly describe the idea of cell-
                                                              tions and citation data. Its user interface unites concept
centric indexing and its advantages and disadvantages,
                                                              lattices and tag clouds to present the query result in
introduce the structure of our server and the methodol-
                                                              a readable composition to promote further exploratory
ogy involved in querying, and finally describe a prototype
                                                              search. On the other hand, Zhang et al. focus their work
interface.
                                                              on knowledge graph data [14]. They combine faceted
                                                              browsing with contextual tag clouds to create a system
2. Related Work                                               that allows users to rapidly explore graphs with billions
                                                              of edges by visualizing conditional dependencies between
Scholars have investigated exploratory search to help         selected classes and other data. Although they don’t use
searchers succeed in an unfamiliar area by proposing          tag clouds, Singh et al. [15] also display conditional de-
novel information retrieval algorithms and systems; some      pendencies in their data outline tool. For a given pivot
of them propose innovative user interfaces, while oth-        attribute and set of automatically determined compare
ers try to predict the user’s information need and to         attributes, they show conditional values, grouped into
use the prediction to better facilitate the subsequent in-    clusters of interaction units.
teraction. Chapman et al. [6] have reviewed different            Other scholars have investigated query languages and
approaches to dataset search. Google’s dataset search         models. Ianina et al. [1] concentrate on developing an
[7] is an example of a traditional approach to indexing       exploratory search system that facilitates the user hav-
web datasets: the system crawls the Web and indexes           ing a way to conduct long text queries, while minimiz-
dataset that have metadata expressed in the schema.org        ing the risk of returning empty results, since the itera-
(or a related) format. The only required properties are       tive “query–browse–refine” process [16] may be time-
name and description.                                         consuming and require expertise. Meanwhile, Ferré and
                                                              Hermann [17] focus more on the query language, LISQL,
and they offer a search system that integrates LISQL and       table as the indexed object, each datum (cell in the table)
faceted search. The system helps users to build complex        is an indexed object. In its simplest form, we propose
queries and enlightens users about their position in the       four fields: content: the value of the cell, title: the label of
data navigation process.                                       the dataset the cell appears in, columnName: the header
   Yet another approach is to predict the user’s search        of the column the cell appears in, and rowContext: the
intent so that better search results can be presented. Pel-    values in all cells in the same row as the indexed cell. For-
tonen et al. [18] utilize negative relevance feedback in       mally, a cell value 𝑉𝑖,𝑗 from table 𝑇 = ⟨𝑙, 𝐻, 𝑉 ⟩ can be in-
an interactive intent model to direct the search. Negative     dexed with: content⋃︀   = 𝑉𝑖,𝑗 , title = 𝑙, columnName = ℎ𝑗 ,
relevance feedback predicts the most relevant keywords,        and rowcontext = 𝑛       𝑘=1 𝑉𝑖,𝑘 . This index would allow
which are later arranged in a radar graph where the            users to find all cells that have a column header token in
center denotes the user, to represent the user’s intent.       common regardless of dataset, or all cells that appear in
Likewise, Ruotsalo et al. [19] propose a similar intent        the same row as some identifying token, or look for the
radar model that predicts a user’s next query in an inter-     occurrence of specific values in specific columns.
active loop. The model uses reinforcement learning to             However, in this form, users still need to know which
control the exploration and exploitation of the results.       keywords to use and which fields to use them in. A cell-
                                                               centric index alone is not helpful to a user who is not
                                                               already familiar with the collection of datasets. In order
3. Cell-Centric Indexing                                       to support the user in exploring the data, we propose
                                                               the abstraction conditional frequency vectors (CFVs). Let
We define a table as 𝑇 = ⟨𝑙, 𝐻, 𝑉 ⟩ where 𝑙 is the label
                                                               𝐼 be a set of items, 𝐷 be a set of descriptors (e.g., tags
of the table, 𝐻 = ⟨ℎ1 , ℎ2 , ..., ℎ𝑛 ⟩ is a list of the column
                                                               that describe the items), and 𝐹 ⊆ 𝐼 × 𝐷 be a set of item
headers, and 𝑉 is an 𝑚×𝑛 matrix of the values contained
                                                               and descriptor pairs ⟨𝑥𝑖 , 𝑑𝑖 ⟩. Let 𝑄 be a query, where
in the table. 𝑉𝑖,𝑗 refers to the value in the 𝑖-th row and
                                                               𝑄(𝐹 ) ⊆ 𝐹 represents the pairs for only those items that
the 𝑗-th column, which has the heading ℎ𝑗 . We note
                                                               match 𝑄. Then a CFV for 𝑄 and 𝐹 is a set of descriptor-
that this model can be easily extended to include other
                                                               frequency pairs where the frequency is the number of
metadata, as appropriate.
                                                               times that the corresponding descriptor occurs within
   A naïve approach to indexing a collection of datasets
                                                               𝑄(𝐹 ): {⟨𝑑, 𝑓 ⟩ | 𝑓 = #{⟨𝑥, 𝑑⟩| ⟨𝑥, 𝑑⟩ ∈ 𝑄(𝐹 )}}. For
would be to simply treat each table as a document, and
                                                               cell-centric indexing, the items 𝐼 are the set of all cells re-
have separate fields for the label, column headings, and
                                                               gardless of source dataset, and 𝐹𝑖 pairs cells with terms
(possibly) values. When terms are used consistently and
                                                               from the 𝑖-th field. For example, if a cell 𝑐5 was in a
the user is familiar with the terminology, this may work
                                                               column titled "Real Estate Price," then 𝐹𝑐𝑜𝑙𝑢𝑚𝑛𝑁 𝑎𝑚𝑒 in-
well. However, this approach has several weaknesses:
                                                               cludes the pairs ⟨𝑐5 , 𝑟𝑒𝑎𝑙⟩, ⟨𝑐5 , 𝑒𝑠𝑡𝑎𝑡𝑒⟩, and ⟨𝑐5 , 𝑝𝑟𝑖𝑐𝑒⟩.
      • Any query on values has lost context of what Typically, we sort the CFV in terms of descending fre-
        column the value appears in and what identify- quency.
        ing information might be present elsewhere in
        the same row. For example, a table that contains
        capitals like (Paris, France) and (Austin, Texas) 4. System Architecture
        is unlikely to be relevant to a query about “Paris
        Texas” but would otherwise match.                      The architecture of the system is depicted in Figure 1. At
      • It is difficult to determine which new terms can the core of our system is an Elasticsearch server. Elastic-
        be used to refine the query. Users would need search [20] is a scalable, distributed search engine that
        to download some of the datasets and choose also supports complex analytics. Our system has two
        distinctive terms from the most relevant ones.         main functions: 1) parse collections of datasets, map
      • A user’s constraint could be represented in dif- them into the fields of a cell-centric index, and send index-
        ferent tables in very different ways. If the user ing requests to Elasticsearch; and, 2) given a user query,
        is looking for “California Housing Prices”, there issue a series of queries to Elasticsearch and construct
        may be a table with some variant of that name, histograms (CFVs) for each field. The Query Processor
        there may be a “Real Estate Prices” table with translates our high-level query API into specific Elastic-
        rows specific to California, or there may be a Search queries, and assembles the results into CFVs.
        “Housing Prices” table that has a column for each
        state, including California. A user should be able 4.1. Index Definition
        to explore the collection to see how the data is
        organized and what terminology is used.                In Elasticsearch, a mapping defines how a document will
                                                               be indexed: what fields will be used and how they will
   We have proposed cell-centric indexing as a novel way be processed. In cell-centric indexing the cell is the
to address the problems above. Rather than treating the
                                                               4.2. Indexing a Dataset
                                                               The system loads each dataset using the following pro-
                                                               cess:
                                                                   1. Read the metadata, which can include title, tags,
                                                                      notes and organization. If the original table is for-
                                                                      matted as CSV, then this data might be contained
                                                                      in a separate file in the same directory, or as a row
                                                                      in a repository index file. If the table is formatted
                                                                      using JSON, the metadata may be specified along
                                                                      with the content, and there may be many datasets
                                                                      described in a single file.
                                                                   2. Read the column headings ⟨ℎ1 , ℎ2 , ..., ℎ𝑛 ⟩
                                                                   3. For each row in the dataset:
                                                                          a) Read the row values: ⟨𝑣1 , 𝑣2 , ..., 𝑣𝑛 ⟩
                                                                          b) Create rowContext by concatenating the
                                                                             values in the row. Note, to avoid creat-
Figure 1: High-level system architecture                                     ing different large context strings for each
                                                                             value in the row, we create a single row-
                                                                             Context. This means that each value is also
                                                                             part of its own row context. This decision
document, and our index must have fields that describe
                                                                             helps make the system more efficient. An
cells. Our mapping is summarized in Table 1. In addition
                                                                             additional efficiency consideration is that
to the four fields mentioned in Section 3, we have fields
                                                                             each value included in rowContext is trun-
for the fullTitle (used to identify which specific datasets
                                                                             cated to the first 100 characters.
match the query) and metadata such as tags, notes, or-
                                                                          c) Build an index request for each cell value
ganization, and setId. The setId allows us to distinguish
                                                                             𝑣𝑖 . If the content is numeric (integer or
between different datasets with the same title, and to get
                                                                             real), it will be indexed in the contentNu-
an accurate count of how many datasets match a query.
                                                                             meric field; otherwise it is indexed in the
Note, that content is divided into two fields: content and
                                                                             content field. The columnName field will
contentNumeric, for reasons that will be described below.
                                                                             be indexed with the corresponding header
For each field, we give its type and, if applicable, the
                                                                             ℎ𝑖 . The title is indexed twice, once as a
analyzer used to process text from the field.
                                                                             tokenized field that can be used in queries,
   We use three types of fields: text, keyword, and double.
                                                                             and again as a keyword field that preserves
Text type fields are tokenized and processed by word
                                                                             the order of the title and can be used to
analyzers, whereas keyword type fields are indexed as
                                                                             precisely identify the dataset the cell origi-
is (without tokenization or further processing). Double
                                                                             nated from. All other metadata fields are
type fields are used to store 64-bit floating point numbers.
                                                                             indexed in a straight-forward way.
Most of our fields are text fields, but contentNumeric is a
double field, which allows it to store both integer and real
numeric values, and both fullTitle and setid are keyword                     Field          Type       Analyzer
fields, since we want users to be able to view the complete             columnName           text    wordDelimiter
name of the dataset in the result, and there is no need to                 content           text          stop
parse setIds.                                                         contentNumeric       double          N/A
   All text fields require an analyzer which determines                  rowContext          text          stop
how to tokenize the field and if any additional processing                   title           text          stop
is required. We use two built-in Elasticsearch analyzers:                  fullTitle      keyword          N/A
the stop analyzer divides text at all non-letter characters                  tags            text          stop
and removes 33 stop words (such as “a”, “the”, “to”, etc.).                 notes            text          stop
For most fields, we use the stop analyzer, but we use                   organization         text          stop
                                                                             setid        keyword          N/A
the wordDelimiter analyzer for the colunnName field. In
addition to dividing text at all non-letter characters, it Table 1
also divides text at letter case transitions (e.g., “birthDate” Elasticsearch mappings used to implement cell-centric index-
is tokenized to “birth” and “date”). This analyzer does ing
not remove stop words.
Figure 2: Initial pre-query histograms



           d) For efficiency, index requests are batched           5. If 𝑎 < 𝜇 − 1.5𝜎 and 𝑏 > 𝜇 + 1.5𝜎 (i.e., numeric
              and sent to the server in bulk. Synchro-                data is not particularly skewed), build an aggrega-
              nization is disabled in ElasticSearch during            tion query for contentNumeric data using ranges
              bulk loading to avoid excessive delays.                 calculated from 𝜇 and 𝜎: the lowest range is [𝑎,
                                                                      𝜇 – 1.5𝜎] and the highest is [𝜇 + 1.5𝜎, 𝑏], where
                                                                      there are 3 intermediate ranges of uniform size
4.3. Querying the index                                               with the middle range 𝜇 – 0.5𝜎, 𝜇 + 0.5𝜎]. If data
Our Query Processor takes a conjunctive, fielded query                are skewed, the ranges are shifted appropriately.
and returns a histogram for each response field. The re-           6. Issue an Elasticsearch histogram aggregation
sponse fields are fields that contain information that helps          query with the calculated ranges. Treat each
the user understand the characteristics of cells that match           range as a content term, and insert these terms
the query. Currently, response fields are title, column-              and their frequencies into the the content CFV.
Name, content, rowContext, and fullTitle. Given a query            7. Return CFVs for each response field.
𝑞, the query process is:                                          Much of the processing above allows the system to
    1. Issue query 𝑞 requesting term aggregations for ti-      dynamically determine buckets for numeric content that
       tle, columnName, content, rowContext and fullTi-        provide a useful picture of its distribution. Unlike textual
       tle, Term aggregations are a feature of Elastic-        terms, numeric terms exhibit greater variability. His-
       search that return a list of terms that appear in the   tograms built using distinct numeric strings are unlikely
       selected documents, along with their frequency,         to have significant value. For example, “135”, “135.0” and
       i.e., CFV’s for 𝑞.                                      “1.35E+2” are all equivalent, while many users might con-
    2. Calculate the min 𝑎 and max 𝑏 for matching con-         sider “135.0001” to be close enough. To address this, we
       tentNumeric data.                                       create ranges over numeric values. Our approach com-
    3. Select a representative set 𝑁 of matching numeric       putes the mean and standard deviation over the middle
       content by issuing a percentile query against the       90% of data, thus removing the influence of outliers, and
       contentNumeric field that excludes the top and          then specifies the buckets to have a width of one standard
       bottom 5 percent of the data.                           deviation with one bucket centered over the mean. Once
                                                               the histogram of numeric ranges is created, its data is
    4. Calculate the mean 𝜇 and standard deviation 𝜎
                                                               merged with the content histogram to produce a single
       of set 𝑁 .
                                                               histogram that shows frequencies of textual terms and
Figure 3: Search results with query: title=olympics



numbers within ranges that depend both on the dataset          in the content histogram. Once this is added the user
and the query.                                                 might direct themselves to the full title histogram shown
                                                               in Fig. 4. There the user can find a dataset titled “Kenya
                                                               at the Olympics Medalists”. To gain access to the dataset
5. Prototype User Interface                                    the user must add the full title to their query. Once a
                                                               full title is in the query a button appears that performs a
An example of a typical use case is demonstrated using
                                                               Google query of the full title. Since this specific dataset is
Figures 2-4. In this specific case, the user wants to find a
                                                               from WikiTables, the Google query will provide a link to
dataset containing data on Kenya’s performance in the
                                                               the Wikipedia page containing the table. We now discuss
2004 Athens Olympics. Initially, the user is presented
                                                               specific interface components in more detail.
with the graphs in Fig. 2. These histograms show the
most frequent title and column terms in the collection
of indexed datasets. However, the example histograms           5.1. Pre-query Histograms
do not initially show anything regarding the Olympics.         Before any search parameters are set, the user is shown
By using the “More Items” button at the bottom of the          two pre-query histograms that return up to the 50 most
title histogram the user can find the term Olympics and        frequent title and column name tokens within the current
add it directly to their query. After this term is added the   repository (see Fig. 2). Column name and title histograms
screen changes to that shown in Fig. 3. The user can now       provide a good overview and are vital in allowing the
look through all 4 histograms and decide which term best       user to explore the datasets without prior knowledge of
helps them get to their desired data. Once again using         the contents. The pre-query histograms are presented to
the “More Items” button, the term Athens can be found          the user when there are no active queries, such as when
the page is initially loaded or when all queries have been
deleted. Clicking on a histogram bar will automatically
add the corresponding term to the query and generate
the standard set of histograms.

5.2. Results Histograms
The standard screen displays the user’s current query
and five histograms. Each histogram is associated with a
field, and tokens are sorted in descending frequency of co-
occurrence with the query. The length of a bar indicates
how many cells match the query. As with the pre-query
histograms, clicking on a bar adds the associated term to
the query, and generates a new result histogram. Below          Figure 4: Results of Full Title Histogram with query:
each histogram is an option to provide more results on the      title=olympics, content=athens
histogram. Initially, each histogram presents the top 10
results, however, the top 25 results are pre-fetched which
allows the newly requested results to be automatically
added to the histogram.                                         content=“athens”. For this refinement, we show the Full
   Due to the connection between the count of matched           Title histogram (see Fig. 4). In this histogram, the bars
cells and bar length, there is the possibility that the first   represent the number of cells in a data set that match the
bar will be significantly larger than all remaining bars,       user’s query. The user can add this bar to the query to get
making them difficult to see or select. To combat this,         specific information about the distribution of terms in
we compare the counts of the two most frequent results.         the chosen dataset. Additionally, this enables the option
If the first result contains 10 times more hits than the        to search for the dataset, which is accomplished using a
second most frequent we change the scale of the his-            Google query of the dataset’s full title.1 The user can con-
tograms to logarithmic, thus making it easier to visualize      tinue to explore the dataset collection by adding terms
distinctions in skewed distributions.                           to and removing terms from the query.
   Figure 3 shows the response of our prototype inter-
face to the query with title=“olympics”. It displays a          6. Conclusion
CFV for each field as a histogram; the longer the red bar,
the more frequently that term co-occurs with the query.         We have proposed cell-centric indexing as an innova-
As we can see, 318 datasets contain matches, and after          tive approach to information retrieval of tabular datasets.
“olympics,” the most common title word is “summer.” The         Such indices support richer queries about tables that do
most frequently-occurring terms in the column names of          not require the user to know the pre-existing structure
matching cells are “RANK” and “attempts”. The content           of each table. They also provide the potential for new
histogram combines terms with numeric ranges. In par-           exploratory interfaces, and we describe one that gives
ticular, the first, second, and fifth rows were all inserted    users summaries of a dataset repository in terms of ti-
by the numeric range processing (as described previously        tles, content, and column names. The user can filter on
in Sect. 4.3). For this query, there are many cells with        any of these facets to generate more specific summaries.
values between 0 and 4, and slightly fewer with values          Future work will test the effectiveness of this novel ap-
between 4 and 21. The next most common content val-             proach in facilitating dataset searches especially amongst
ues are the terms “olympics” and “summer.” Note, the            non-expert users.
figure does not show the histogram for full titles that
corresponds to this query (but is still part of the proto-
type interface). As discussed in the next paragraph, this       Acknowledgments
histogram indicates how many matching cells are in each
dataset.                                                        This material is based upon work supported by the Na-
   The user can refine their query and create new his-          tional Science Foundation under Grant No. III-1816325.
tograms by clicking on any terms in the result. For ex-         Lixuan Qiu and Drake Johnson contributed to early drafts
ample, if the user clicks on “athens” in the content his-       of this paper. We thank Alex Johnson, dePaul Miller,
togram (after scrolling down), the system will display          Keith Register, and Xuewei Wang for contributions to
a new set of histograms summarizing the datasets that           the system implementation.
have “athens” as a content field and “olympics” in the ti-          1
                                                                      Many of our dataset collections do not have a URL recorded,
tle; in other words the query will be title=“olympics” and      which is why we do not simply link to the dataset as a result.
References                                                    [11] A. Bozzon, M. Brambilla, S. Ceri, P. Fraternali, Liq-
                                                                   uid query: Multi-domain exploratory search on
 [1] A. Ianina, L. Golitsyn, K. Vorontsov, Multi-objective         the web, in: Proceedings of the 19th International
     topic modeling for exploratory search in tech news,           Conference on World Wide Web, WWW ’10, As-
     in: A. Filchenkov, L. Pivovarova, J. Žižka (Eds.), Ar-        sociation for Computing Machinery, New York,
     tificial Intelligence and Natural Language, Springer,         NY, USA, 2010, p. 161–170. doi:10.1145/1772690.
     2017, pp. 181–193. Communications in Computer                 1772708.
     and Information Science, vol 789.                        [12] J. Fan, D. A. Keim, Y. Gao, H. Luo, Z. Li,
 [2] H. Borchart, Effects of content preview on query              Justclick: Personalized image recommendation via
     refinement in dataset search, Senior Project Re-              exploratory search from large-scale flickr images,
     port, Cognitive Science Program, Lehigh University,           IEEE Transactions on Circuits and Systems for
     Bethlethem, PA, 2021.                                         Video Technology 19 (2008) 273–288.
 [3] L. Miller, Facilitating dataset search of non-expert     [13] M. Dunaiski, G. J. Greene, B. Fischer, Exploratory
     users through heuristic and systematic information            search of academic publication and citation data
     processing, Honors Thesis, Cognitive Science Pro-             using interactive tag cloud visualizations, Sciento-
     gram, Lehigh University, Bethlethem, PA, 2020.                metrics 110 (2017) 1539–1571.
 [4] D. Johnson, K. Register, B. D. Davison, J. Heflin, An    [14] X. Zhang, D. Song, S. Priya, J. Heflin, Infrastructure
     exploratory interface for dataset repositories using          for efficient exploration of large scale linked data
     cell-centric indexing, in: Proceedings of the 2020            via contextual tag clouds, in: International Seman-
     IEEE International Conference on Big Data (IEEE               tic Web Conference, Springer, 2013, pp. 687–702.
     BigData 2020), 2020, pp. 5716–5718. Poster paper.        [15] M. Singh, M. J. Cafarella, H. V. Jagadish, Dbex-
 [5] L. Qiu, H. Jia, B. D. Davison, J. Heflin, An architec-        plorer: Exploratory search in databases, in: E. Pi-
     ture for cell-centric indexing of datasets, in: Pro-          toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca,
     ceedings of PROFILES’20: 7th International Work-              I. Manolescu, K. Stefanidis (Eds.), Proceedings
     shop on Dataset PROFILing and Search, 2020, pp.               of the 19th International Conference on Extend-
     82–96. Held with ISWC 2020.                                   ing Database Technology, EDBT 2016, Bordeaux,
 [6] A. Chapman, E. Simperl, L. Koesten, G. Konstan-               France, March 15-16, 2016, OpenProceedings.org,
     tinidis, L.-D. Ibáñez, E. Kacprzak, P. Groth, Dataset         2016, pp. 89–100. doi:10.5441/002/edbt.2016.
     search: a survey, The VLDB Journal 29 (2020) 251–             11.
     272.                                                     [16] R. W. White, R. A. Roth, Exploratory Search: Be-
 [7] N. Noy, M. Burgess, D. Brickley, Google dataset               yond the Query-Response Paradigm, Synthesis Lec-
     search: Building a search engine for datasets in an           tures on Information Concepts, Retrieval, and Ser-
     open Web ecosystem, in: Proceedings of The Web                vices, Morgan & Claypool Publishers, 2009. doi:10.
     Conference, 2019, pp. 1365–1375.                              2200/S00174ED1V01Y200901ICR003.
 [8] M. Derthick, J. Kolojejchick, S. F. Roth, An in-         [17] S. Ferré, A. Hermann, Semantic search: Reconciling
     teractive visual query environment for exploring              expressive querying and exploratory search, in:
     data, in: Proceedings of the 10th Annual ACM                  International Semantic Web Conference, Springer,
     Symposium on User Interface Software and Tech-                2011, pp. 177–192.
     nology, UIST ’97, Association for Computing Ma-          [18] J. Peltonen, J. Strahl, P. Floréen, Negative rele-
     chinery, New York, NY, USA, 1997, p. 189–198.                 vance feedback for exploratory search with visual
     doi:10.1145/263407.263545.                                    interactive intent modeling, in: Proceedings of
 [9] S. Yogev, H. Roitman, D. Carmel, N. Zwerdling, To-            the 22nd International Conference on Intelligent
     wards expressive exploratory search over entity-              User Interfaces, IUI ’17, Association for Computing
     relationship data, in: Proceedings of the 21st Inter-         Machinery, New York, NY, USA, 2017, p. 149–159.
     national Conference on World Wide Web, WWW                    doi:10.1145/3025171.3025222.
     ’12 Companion, Association for Computing Machin-         [19] T. Ruotsalo, J. Peltonen, M. J. A. Eugster,
     ery, New York, NY, USA, 2012, p. 83–92. doi:10.               D. Głowacka, P. Floréen, P. Myllymäki, G. Jacucci,
     1145/2187980.2187990.                                         S. Kaski, Interactive intent modeling for exploratory
[10] E. Koh, A. Kerne, R. Hill, Creativity support: In-            search, ACM Trans. Inf. Syst. 36 (2018). doi:10.
     formation discovery and exploratory search, in:               1145/3231593.
     Proceedings of the 30th Annual International ACM         [20] C. Gormley, Z. Tong, Elasticsearch: the definitive
     SIGIR Conference on Research and Development                  guide: a distributed real-time search and analytics
     in Information Retrieval, SIGIR ’07, Association for          engine, O’Reilly Media, Inc., 2015.
     Computing Machinery, New York, NY, USA, 2007,
     p. 895–896. doi:10.1145/1277741.1277963.