=Paper= {{Paper |id=Vol-2722/profiles2020-paper-2 |storemode=property |title=An Architecture for Cell-Centric Indexing of Datasets |pdfUrl=https://ceur-ws.org/Vol-2722/profiles2020-paper-2.pdf |volume=Vol-2722 |authors=Lixuan Qiu,Haiyan Jia,Brian Davison,Jeff Heflin |dblpUrl=https://dblp.org/rec/conf/semweb/QiuJ0H20 }} ==An Architecture for Cell-Centric Indexing of Datasets== https://ceur-ws.org/Vol-2722/profiles2020-paper-2.pdf
                      An Architecture for
                Cell-Centric Indexing of Datasets

         Lixuan Qiu1,2 , Haiyan Jia3 , Brian D. Davison2 , and Jeff Heflin2
           1
                 Dept. of Computer Science, University of Southern California,
           2
                Dept. of Computer Science and Engineering, Lehigh University
               3
                  Dept. of Journalism and Communication, Lehigh University
                   lixuanqi@usc.edu, {haj616,bdd3,jeh3}@lehigh.edu



        Abstract. Increasingly, large collections of datasets are made available
        to the public via the Web, ranging from government-curated datasets
        like those of data.gov to communally-sourced datasets such as Wikipedia
        tables. It has become clear that traditional search techniques are insuf-
        ficient for such sources, especially when the user is unfamiliar with the
        terminology used by the creators of the relevant datasets. We propose
        to address this problem by elevating the datum to a first-class object
        that is indexed, thereby making it less dependent on how a dataset is
        structured. In a data table, a cell contains a value for a particular row as
        described by a particular column. In our cell-centric indexing approach,
        we index the metadata of each cell, so that information about its dataset
        and column simply become metadata rather than constraining concepts.
        In this paper we define cell-centric indexing and present a system ar-
        chitecture that supports its use in exploring datasets. We describe how
        cell-centric indexing can be implemented using traditional information
        retrieval technology and evaluate the scalability of the architecture.


1     Introduction

  The twenty-first century has experienced an information explosion; data is
growing exponentially and users’ information retrieval needs are becoming much
more complicated [7]. Given people’s increasing interests in datasets, there is
a need for user-friendly search services for data journalists, scientists, decision
makers, and the general public to locate datasets that can help them answer
their questions.
    Even though users, under many circumstances, are not experts in the domain
in which they search, they should be able to easily use such an application; the
query process should be responsive and efficient, the result should provide a
general picture of what the dataset is about, and offer enough information for
the users to know how likely the dataset will contain data that they need.
    We present the novel concept of cell-centric indexing as the solution for data
storage and querying. Traditional database management systems group data by

    Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       Qiu et al.

tables and then organize this data into rows and columns. Even column-oriented
databases still assume this model, although data is stored by column instead of
by row as in a relational database management system. These approaches are
suitable when the users want to construct queries based on their knowledge of the
schema, but what if they want to find out how specific data is stored? To enable
flexible, schema-optional queries, we build inverted indices using individual cells
as the fundamental unit.
    These indices provide different fields that index both the content of the cell
and its context. For our purposes, the context includes other cell values in the
same row, the name of the column (if available), and metadata about the con-
taining dataset. This approach allows us to refine our search by row descriptors,
column descriptors or both at the same time. In essence we free the data from
how it is structured, and schema information, when available, is merely one of
the many ways to locate the data of interest. Thus, we take the view that funda-
mentally, users are searching for specific data (i.e., particular cells or collections
thereof), and the tables are merely artifacts of how the data is stored.
    We recognize that this approach also has downsides. In particular, an index
of cells (and their contexts) will incur substantial storage overhead in compar-
ison to an index of datasets. Moreover, if the desired search result is one or
more datasets, at run-time there will be additional processing to assemble the
cell-specific results to enable the retrieval and ranking at that level of granular-
ity. However, our cell-centric approach gives us some additional flexibility and
we believe that good system design, appropriate data structures, and efficient
algorithms can ameliorate the costs.
    The contributions of this paper are: (1) We propose cell-centric indexing as
an innovative approach to an information retrieval system. A cell-centric index
enables a user to find data without having to know the pre-existing structure of
each table; (2) We introduce the mechanisms of our dataset search engine. We
describe the structure and method of data storage and querying of our server;
(3) We examine the efficiency of indexing and querying datasets.
    The rest of the paper is organized as follows: we first discuss related work,
briefly describe the idea of cell centric indexing and its advantages and disad-
vantages, introduce the structure of our server and the methodology involved
in querying, and finally provide experimental results about building the indices
and querying them.


2    Related Work

Scholars have investigated exploratory search to help searchers succeed in an
unfamiliar area by proposing novel information retrieval algorithms and systems;
some of them propose innovative user interfaces, while others try to predict
the user’s information need and to use the prediction to better facilitate the
subsequent interaction.
   Derthick et al. [3] describe a visual query language that dynamically links
queries and visualizations. The system helps a user to locate information in
                       An Architecture for Cell-Centric Indexing of Datasets       3

a multi-object database, illustrates the complicated relationship between at-
tributes of multiple objects, and assists the user to clearly express their infor-
mation retrieval needs in their queries. Similarly, Yogev et al. [13] demonstrate
an exploratory search approach for entity-relationship data. The approach com-
bines an expressive query language, exploratory search, and entity-relationship
graph navigation. Their work enables people with little to no query language
expertise to query rich entity-relationship data.
    In the domain of web search, Koh et al. [8] devise a user interface that sup-
ports creativity in education and research. The model allows users to send their
query to their desired commercial search engine or social platform in iterations.
As the system goes through each iteration, it will combine the text and image
results into a composition space. Addressing a similar problem, Bozzon et al. [1]
design an interactive user interface that employs exploratory search and Yahoo!
Query Language (YQL) to empower users to iteratively investigate results across
multiple sources.
    A tag cloud is a common and useful visualization of data that represents
relative importance or frequency via size. Some researchers have adapted this
idea to visualize query results. Fan et al. [5] focus on designing an interactive
user interface with image clouds. The interface enables users to comprehend
their latent query intentions and direct the system to form their personalized
image recommendations. Dunaiski et al. [4] design and evaluate a search engine
that incorporates exploratory search to ease researchers’ scouting for academic
publications and citation data. Its user interface unites concept lattices and tag
clouds to present the query result in a readable composition to promote further
exploratory search. On the other hand, Zhang et al. focus their work on graph
data [14]. They combine faceted browsing with contextual tag clouds to create
a system that allows users to rapidly explore knowledge graphs with billions
of edges by visualizing conditional dependencies between selected classes and
other data. Although they don’t use tag clouds, Singh et al. [11] also display
conditional dependencies in their data outline tool. For a given pivot attribute
and set of automatically determined compare attributes, they show conditional
values, grouped into clusters of interaction units.
    In addition, scholars investigate query languages and models. Ianina et al. [7]
concentrate on developing an exploratory search system that facilitates the user
having a way to conduct long text queries, while minimizing the risk of return-
ing empty results, since the iterative “query–browse–refine” process [12] may
be time-consuming and require expertise. Meanwhile, Ferré and Hermann [6]
focus more on the query language, LISQL, and they offer a search system that
integrates LISQL and faceted search. The system helps users to build complex
queries and enlightens users about their position in the data navigation process.
    Other researchers try to think ahead of the user: i.e., predict the user’s search
intent so that they can better direct the search result. Peltonen et al. [9] utilize
negative relevance feedback in an interactive intent model to direct the search.
Negative relevance feedback predicts the most relevant keywords, which are later
arranged in a radar graph where the center denotes the user, to represent the
user’s intent. Likewise, Ruotsalo et al. [10] propose a similar intent radar model
4       Qiu et al.

that predicts a user’s next query in an interactive loop. The model uses rein-
forcement learning to control the exploration and exploitation of the results.


3   Cell-Centric Indexing
We define a table as T = hl, H, V i where l is the label of the table, H =
hh1 , h2 , ..., hn i is a list of the column headers, and V is an m × n matrix of
the values contained in the table. Vi,j refers to the value in the i-th row and the
j-th column, which has the heading hj .
     Inverted indices are an important tool in information retrieval. An inverted
index consists of a lexicon and a posting list for each term. The lexicon is simply
an index of the unique terms contained in the document collection. The posting
list for a term w is a sorted list of documents (in reality, document identifiers)
that contain w.
     Given a conjunctive query Q = {q1 , q2 , ...}, the set of matching documents
can be found by simply intersecting the posting lists for each query term qi . If
the lexicon is sorted, then a binary search can find each posting list in O(log n)
time, and since the posting lists are sorted, a parallel walk through these lists
can find the intersection in linear time. This allows for very fast response times,
even when many documents are indexed.
     Typically, an information retrieval system allows indexing in different fields.
For example, the index for a collection of documents might have one field for
title, another for subheadings, and another for body text. Among other things,
fields can be used to differentiate the importance of a word based on where
it appears: a word that appears in the title of a document is probably more
meaningful than one that only appears once in its body.
     A naı̈ve approach to indexing a collection of datasets would be to simply
treat each table as a document, and have separate fields for the label, column
headings, and (possibly) values. When terms are used consistently and the user
is familiar with the terminology, this will often work well. However, this approach
has several weaknesses:
 1. Any query on values has lost context of what column the value appears in
    and what identifying information might be present elsewhere in the same
    row. For example, a table that contains capitals like (Paris, France) and
    (Austin, Texas) is unlikely to be relevant to a query about “Paris Texas”
 2. It is difficult to determine which new terms can be used to refine the query.
    Users would need to download some of the datasets and choose distinctive
    terms from the most relevant ones.
 3. A user’s constraint could be represented in different tables in very different
    ways. If the user is looking for “California Housing Prices”, there may be a
    table with some variant of that name, there may be a “Real Estate Prices”
    table with rows specific to California, or there may be a “Housing Prices”
    table that has a column for each state, including California. A user should
    be able to explore the collection to see how the data is organized and what
    terminology is used.
                       An Architecture for Cell-Centric Indexing of Datasets       5

    We have proposed cell-centric indexing as a novel way to address the problems
above. Rather than treating the table as the indexed object, each datum (cell
in the table) is an indexed object. In its simplest form, we propose four fields:
content: the value of the cell, title: the label of the dataset the cell appears in,
columnName: the header of the column the cell appears in, and rowContext:
the values in all cells in the same row as the indexed cell. Formally, a cell value
Vi,j from table T = hl, H, V i can beSindexed with: content = Vi,j , title = l,
                                          n
columnName = hj , and rowcontext = k=1 Vi,k . This index would allow users to
find all cells that have a column header token in common regardless of dataset,
or all cells that appear in the same row as some identifying token, or look for
the occurrence of specific values in specific columns.
    However, in this form, users still need to know which keywords to use and
which fields to use them in. A cell-centric index alone is not helpful to a user
who is not already familiar with the collection of datasets. In order to support
the user in exploring the data, we propose the abstraction conditional frequency
vectors (CFVs). Let I be a set of items, D be a set of descriptors (e.g., tags that
describe the items), and F ⊆ I × D be a set of item and descriptor pairs hxi , di i.
Let Q be a query, where Q(F ) ⊆ F represents the pairs for only those items that
match Q. Then a CFV for Q and F is a set of descriptor-frequency pairs where
the frequency is the number of times that the corresponding descriptor occurs
within Q(F ): {hd, f i | f = #{hx, di| hx, di ∈ Q(F )}}. For cell-centric indexing,
the items I are the set of all cells regardless of source dataset, and Fi pairs cells
with terms from the i-th field. Typically, we sort the CFV in terms of descending
frequency.
    Figure 1 shows the response of our prototype interface to the query with ti-
tle=“olympics” and content=“athens”. It displays a CFV for each field as a his-
togram; the longer the red bar, the more frequently that term co-occurs with the
query. The CFV for the title field is hholympics,32i, hmedalists,21i,hlist,14i,...i.
As we can see, 15 datasets contain matches, most of which have “medalists” in
the title and “games” in the column name. The user can refine this query and
create new histograms by clicking on any terms in the result. For example, if the
user clicks on “kenya” in the title histogram, the system will display a new set
of histograms summarizing the datasets that have “athens” as a content field
and both “olympics” and “kenya” in the title. The user explores the dataset
collection by adding terms to and removing terms from the query.


4   System Architecture

The architecture of the system is depicted in Figure 2. At the core of our system
is an Elasticsearch server. Elasticsearch is a scalable, distributed search engine
that also supports complex analytics. Our system has two main functions: 1)
parse collections of datasets, map them into the fields of a cell-centric index,
and send indexing requests to Elasticsearch and 2) given a user query, issue a
series of queries to Elasticsearch and construct histograms (CFVs) for each field.
6       Qiu et al.




            Fig. 1: Results for the Query {title:olympics,content:athens}

4.1   Index Definition

In Elasticsearch, a mapping defines how a document will be indexed: what fields
will be used and how they will be processed. In cell-centric indexing the cell is
the document, and our index must have fields that describe cells. Our mapping
is summarized in Table 1. In addition to the four fields mentioned in Section 3,
we have fields for the fullTitle (used to identify which specific datasets match the
query) and metadata such as tags, notes, and organization. Note, that content
is divided into two fields: content and contentNumeric, for reasons that will be
described below. For each field, we give its type and, if applicable, the analyzer
used to process text from the field.
    We use three types of fields: text, keyword, and double. Text type fields are
tokenized and processed by word analyzers, whereas keyword type fields are
indexed as is (without tokenization or further processing). Double type fields are
used to store 64-bit floating point numbers. Most of our fields are text fields, but
contentNumeric is a double field, which allows it to store both integer and real
numeric values, and fullTitle is a keyword field, since we want users to be able
to view the complete name of the dataset in the result.
    All text fields require an analyzer which determines how to tokenize the field
and if any additional processing is required. The stop analyzer divides text at
all non-letter characters and removes 33 stop words (such as “a”, “the”, ’“to”,
etc.). For most fields, we use the stop analyzer, but we use the wordDelimiter
analyzer for the colunnName field. In addition to dividing text at all non-letter
                      An Architecture for Cell-Centric Indexing of Datasets    7




         Fig. 2: Flowchart that shows the general work flow of the project

characters, it also divides text at letter case transitions (e.g., “birthDate” is
tokenized to “birth” and “date”). This analyzer does not remove stop words.


                            Field       Type     Analyzer
                       columnName        text wordDelimiter
                          content        text      stop
                     contentNumeric double         N/A
                        rowContext       text      stop
                            title        text      stop
                          fullTitle   keyword      N/A
                            tags         text      stop
                            notes        text      stop
                       organization      text      stop
      Tab. 1: Elasticsearch mappings used to implement cell-centric indexing

4.2   Indexing a Dateset

The system loads each dataset using the following process:

1. Read the metadata, which can include title, tags, notes and organization. If
   the original table is formatted as CSV, then this data might be contained in
   a separate file in the same directory, or as a row in a repository index file.
   If the table is formatted using JSON, the metadata may be specified along
   with the content, and there may be many datasets described in a single file.
2. Read the column headings of the data: hh1 , h2 , ..., hn i
3. For each row in the dataset:
   (a) Read the row values: hv1 , v2 , ..., vn i
   (b) Create rowContext by concatenating the values in the row. Note, to
        avoid creating different large context strings for each value in the row,
8           Qiu et al.

            we create a single rowContext. This means that each value is also part
            of its own row context. This decision helps make the system more effi-
            cient. An additional efficiency consideration is that each value included
            in rowContext is truncated to the first 100 characters.
        (c) Build an index request for each cell value vi . If the content is numeric
            (integer or real), it will be indexed in the contentNumeric field; otherwise
            it is indexed in the content field. The columnName field will be indexed
            with the corresponding header hi . The title is indexed twice, once as a
            tokenized field that can be used in queries, and again as keyword field
            that preserves the order of the title and can be used to precisely identify
            the dataset the cell originated from. All other metadata fields are indexed
            in a straight-forward way.
        (d) For efficiency, index requests are batched and sent to the server in bulk.
            Synchronization is disabled during bulk loading to avoid excessive delays.

4.3       Querying the index
Algorithm 1 illustrates how a query is processed and how the results are com-
posed. We can get a CFV for a particular field and query using Elasticsearch’s
term aggregation feature. This feature returns a list of terms that appear in the
selected documents along with their frequency. Since we want users to be able to
more accurately locate data that matches their interests, each query generates
histograms for the title, columnName, content, rowContext, and fullTitle fields.
    Unlike textual terms, numeric terms exhibit a greater variability. Histograms
built using distinct numeric strings are unlikely to have significant value. For
example, “135”, “135.0” and “1.35E+2” are all equivalent, while some users
might consider “135.0001” to be close enough. To address this, we create ranges
over numeric values. Elasticsearch supports a histogram aggregation which can
be used to build these distributions. The algorithm evaluated in this paper builds
10 buckets of equal size, although our team is currently experimenting with more
sophisticated algorithms, including one in which we remove outliers, compute the
mean and standard deviation over the remaining values, and then specify the
buckets to have a width of one standard deviation with one bucket centered over
the mean. Once the numeric content histogram is created, this is merged with
the content histogram.

    Algorithm 1: Process of Query Execution
      Data: Query q
    1 Issue query q requesting term aggregations for title, columnName, content,
        rowContext and fullTitle;
    2 Build CFVs from the query response ;
    3 if results contain contentNumeric data then
    4     Reissue q requesting histogram aggregation on contentNumeric data ;
    5     Merge results with content CFV ;
    6   Return columnName, rowContext, title, content, and fullTitle CFVs ;
                         An Architecture for Cell-Centric Indexing of Datasets       9

5     Experiments

In this paper, we evaluate whether the proposed system can be scaled to real-
world collections of datasets while providing acceptable response times to user
queries. All of our experiments are conducted using a Lehigh-hosted server. The
server has one Intel(R) Xeon(R)Silver 4110 CPU @ 2.10GHz, with 8 cores (16
threads) and a total of 96GB RAM. It runs Java 11.0.7 and Elasticsearch 6.8.8.

5.1    Scalability of Loading Datasets

To evaluate the scalability of the system with respect to data loading, we load
two different collections of tables. The WikiTables corpus4 contains 1.6 million
tables extracted from Wikipedia. We created five different subsets of the data,
where each subsequent set contains all of the data in the previous set. The
data.gov corpus [2] contains 7,485 tables downloaded from data.gov, the open
data portal of the United States federal government. We created three subsets
of this data, where each subsequent subset contained the prior subset. In both
cases, we identify the subsets by size of the raw data files. Due to the variability
of length in the values of different datasets, the number of cells is not necessarily
proportional to the directory size. Also, the WikiTables corpus is stored in JSON,
which leads to larger sizes than the CSV files of the data.gov corpus.
     We loaded each subset and recorded the resulting index size in MB and the
time to load the data in minutes. Table 2 displays the results for the WikiTables
corpus. On average, the system can index 1 million table cells per minute and
the resulting index is 100 bytes per cell. Table 3 shows the same experiment
conducted on three different sizes of data.gov data. Here, the system only loads
∼ 650, 000 cells per minute and index size is more variable, although the largest
set requires ∼ 110 bytes per cell. We hypothesize that the slower throughput of
data.gov is due to structural differences in the tables: data.gov tables are likely
to have more columns and longer values in those columns, leading to significantly
longer rowContext. However, due to the efficiencies of inverted indices, this cost
is borne in indexing time (due to longer indexing messages), and the index size
is not affected as much. Despite the difference in throughput, in both cases the
load time scales linearly with the number of the cells, as depicted in Figure 3
and Figure 4.
     We will note that the resulting Elasticsearch index is always larger than the
original input data. For WikiTables, this is about 2.5x larger, while for data.gov
it is 1.5x to 1.7x larger5 . We have experimented with an alternative design that
uses two indices: one for indexing cell-specific information and another for in-
dexing table-specific information. The alternate design reduced index storage,
but increased overall query time. Given that disk space is relatively cheap, and
Elasticsearch makes it easy to add additional machines to scale up the size of a
cluster, we chose to optimize query speed.
4
    http://websail-fe.cs.northwestern.edu/TabEL/tables.json.gz
5
    For the smallest size, the difference is 6.3x larger, but we believe this is due to
    mandatory overhead
10        Qiu et al.

     Directory Size (MB) Size of Index (MB) Num Cells Time (mins)
                         99.3                   249.0     2,432,826           2.4
                        500.3                 1,292.6 12,888,268             13.0
                        999.5                 2,614.3 26,133,840             26.6
                      2,002.0                 5,255.6 52,326,690             53.5
                      3,940.0                10,277.5 102,926,958          106.3
  Tab. 2: Result of uploading various sizes of folders of JSON files from WikiTables
       Directory Size (MB) Size of Index (MB) Num Cells Time (mins)
                           26.0                  163.3    1,835,238           2.9
                          466.3                  694.0    5,398,841           8.7
                          756.4                 1295.2 11,868,845            17.9
     Tab. 3: Result of Uploading Various Sizes of Folders of CSV Files from DataGov
5.2     Extensive Query Simulation
With the purpose of testing the relationship between size of the index and query
efficiency, we created three data collections from the WikiTables corpus. This
results in data files of size 0.8GB, 2.4GB, and 4GB; where each smaller index’s
source is a subset of each larger index. Table 4 shows indexes’ basic information.
                   Source Folder Number of
                                                    Index Size (GB)
                     Size (GB) Cells in index
                                 0.8    20,381,595                  2.1
                                 2.4    61,602,873                  6.2
                                 4.0   102,926,958                10.3
           Tab. 4: Statistics on WikiTable datasets used for query experiment
    In order to evaluate the system’s performance over a wide variety of pos-
sible queries, we designed a random query model. The first step of user query
simulation is to create seed lists containing query terms that we can use as the
“user’s” starting filter. To do so, we build four CFVs using the empty query: title,
columnName, content, and rowContext. Then we choose an appropriate range of
frequency to perform our test; high frequency seeds are not desired because they
do not sufficiently filter the results to be useful in exploration (in effect, they
are analogous to stop words in traditional search); low frequency seeds are not
suitable either, since we are simulating real users’ choices and they are less likely
to query for an uncommon word. Given these criteria, we chose to select terms




Fig. 3: Load time vs. number of cells for   Fig. 4: Load time vs. number of cells for
WikiTable corpus                            data.gov corpus. For comparison, the rele-
                                            vant range of WikiTables is also plotted.
                       An Architecture for Cell-Centric Indexing of Datasets     11

with a frequency between 10,000 to 50,000. Then we create four lists, one each
for seeds from title, columnName, rowContext, and content. Notably, content’s
seed list will only contain textual terms, since numerical terms in a dataset’s
content field are indexed separately by the contentNumeric field. We also gen-
erate a random-category seed list that has all the seeds from the lists above,
so that when we arbitrarily select seeds from this list, each seed’s category is
randomized. We expect that users will rarely query on a specific number, and for
many numbers, their frequency would be too small anyway, so only textual seeds
are included in the seed list. Although some numbers like years might appear in
many user queries, we feel this simplification keeps our set of test queries clean.
    In order to simulate real-world testing conditions, we have designed a model
that predicts the behavior of a normal user: we randomly choose a query from
the seed list, execute the query, randomly select one of the resulting histograms,
randomly select one of the top 20 terms from the resulting histogram, append
the new field:term pair to the existing query and repeat. This process stops after
the query has five terms, or a resulting histogram has two or fewer terms.
    We then analyze the query performance by comparing different starting cat-
egories on the same index or same starting category on three different sizes
(0.8GB, 2.4GB, 4.0GB) of indexes. In this process, timers are set at different
points to measure the time consumed. We run one thousand iterations of the
process, recording the queries and their times to a log file.
    Since the final query of each iteration has a maximum length of five-terms,
we can derive at most five different queries of different length from that original
query. To illustrate this, suppose after an iteration, the final query looks like:
a&b&c&d &e, where each letter represents a one-term query, eg. title:football. We
can obtain five different queries: a, a&b, a&b&c, a&b&c&d, and a&b&c&d &e.
    Five thousand query processes have been performed on each index, one thou-
sand for each category (title, rowContext, content, columnName, and random).
Generally, one thousand one-term queries are guaranteed, but as the query be-
comes longer, the number of queries with corresponding length will be smaller.
Queries with more terms are more restrictive, leading to fewer results. Thus, the
likelihood of satisfying the stopping criteria of too few results becomes greater.
Table 5 displays the final number of queries based on the number of terms and
the starting category on the 4.0 GB index.
                column name content title row context random total
        1-term           1,000     1,000 1,000         1,000    1,000 5,000
        2-term             994       999 994           1,000      999 4,986
        3-term             911       975 930             976      953 4,745
        4-term             706       857 743             835      786 3,927
        5-term             508       662 499             607      597 2,873
         total           4,119     4,493 4,166         4,418    4,335 21,531
Tab. 5: Total number of queries of each length for each starting category on the 4.0
GB index

   Figure 5 summarizes the performance of the system, assuming the first query
term comes from the title field. Here we show the average query time (in millisec-
onds) for each index size for each length of query. As the queries grow in length,
12      Qiu et al.

execution time decreases. The gaps between 1-term and 2-term, 2-term and 3-
term are the largest, since it is much more challenging to further improve the
efficiency when the time usage is at around 50 ms level. It can also be observed
that, although queries on larger indexes take more time, as the size of the index
increases linearly, the time consumed grows sublinearly. For example, one-term
queries on the largest index are only 1.7x slower than those on the smallest,
even though this index is 5x larger. This trend holds regardless of the length of
queries or the choice of the query’s starting category. Figure 6 illustrates this
by showing a similar pattern when the starting term is selected from the four
categories at random. Together, these results imply that the system can scale to
much larger indices with only a small degradation in performance.




Fig. 5: Queries with first term from title field: average time (ms) grouped by query
length and index size




Fig. 6: Queries with first term from random field: average time (ms) grouped by query
length and index size
    Next, we investigate whether the type of the starting field has any significant
impact on query performance. Table 6 shows the average and standard deviation
of time consumed by the five different categories of one-term queries when issued
against the 4.0GB collection. For the last row, i.e. random category, the query
fields are randomized to give a more general sense of the overall performance of
the system. In addition to total query time, we break time down into textual
query time (generating the standard histograms for text fields) and numeric
query time (generating the numeric range histograms). Here, we see that title
queries have the greatest variation in overall time, and that for all fields except
content, the numeric range query processing can take a significant fraction of
                       An Architecture for Cell-Centric Indexing of Datasets      13

the overall query time (30 - 50%). Note, content queries always have a 0 numeric
query time. This is because all of our queries are seeded with text terms and
numeric range processing is only computed when the filtered cells are numeric.
However, since the query term is textual content, there can be no numeric cells
in the result, and thus no numeric range processing is computed.

                        Textual Query Numeric Query Total Query
                          Time (ms)        Time (ms)      Time (ms)
           Category Avg. Std. dev Avg. Std. dev Avg. Std. dev
              title      139        112     93        18 232        114
            content      147         27      0         0 147         27
          rowContext     122         21 121           23 243         36
         columnName 126              31 111           32 237         49
            random       157         86     87        47 244         92
Tab. 6: Average and standard deviation of time consumed by textual and numerical
query on one-term query




Fig. 7: Queries from 4.0 GB collection: average time (ms) grouped by query length and
first term field
    Using the largest index, Figure 7 displays the average query times across
the five different field types on three different query lengths. For most types of
queries, the one-term query takes 230-250 ms. However, queries starting with a
content term take less: only 146.9 on average. Recall from Table 6 that content
queries do not need numeric range processing, and are thus faster. Furthermore,
it can also be observed that queries starting with rowContext will often take more
time than other types of queries. These queries have the longest average times
for both three-term and five-term queries (100.8 ms and 80.6 ms, respectively).
We hypothesize this is because many rowContext terms have a large inverted
index; that is they are associated with many cells.
    One oddity of Figure 7 is that for one-terms queries, rowContext is actually
faster on average than for random queries. In order to understand if outliers play
a role in this observation, we create a box plot of the same data. From Figure 8,
we can see that certain types of queries have more variability in execution times
than others. In particular, one-term queries starting with a random category
have more outliers above the third quartile than rowContext one-term queries.
We also see that the median is slightly below the median of rowContext one-
term queries. Thus, it is these outliers that lead to a higher average query time.
14      Qiu et al.

Note, given that both title queries and columnName queries have many outliers,
we assume that these are the types of queries that contribute the most to the
variability in the random query result. It can also be observed that, of over 20,000
random queries issued over the index, only three took longer than a second, and
only one of those was (slightly) over two seconds.




Fig. 8: Box plot of query times by query length and starting field. The boxes show the
first quartile, median and third quartile. Circles indicate outliers.

6    Conclusion
Cell-centric indexing is an innovative architecture that enables anyone to conduct
schema-less queries on datasets in an exploratory fashion. In this paper, we
discuss the scalability and query efficiency of our system. We were able to load
the entire 1.6 million table WikiTables corpus in 106 minutes and one-term
queries against this index averaged less than 250 ms each. Queries with more
terms took less time. In general, the server is able to parse and upload source
files in linear time; the index will be approximately 2.6 times as large as the size
of the source folder, which is an acceptable trade off between inexpensive disk
space and running time.
     Despite the small size of our test server and being unable to handle a large
number of requests and responses as currently configured, we believe it can pro-
vide an easy and efficient way of locating a dataset of interest. Future work
includes system performance enhancements, more usability features and a thor-
ough user evaluation of the interface.

Acknowledgments
This material is based upon work supported by the National Science Foundation
under Grant No. III-1816325. Alex Johnson, Xuewei Wang, dePaul Miller, Drake
Johnson, and Keith Register contributed to the implementation of the system.

References
 1. Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P.: Liquid query: Multi-domain
    exploratory search on the web. In: Proceedings of the 19th International Confer-
    ence on World Wide Web. p. 161–170. WWW ’10, Association for Computing
    Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1772690.1772708
                         An Architecture for Cell-Centric Indexing of Datasets          15

 2. Chen, Z., Jia, H., Heflin, J., Davison, B.D.: Generating schema labels through
    dataset content analysis. In: Companion Proceedings of the The Web Con-
    ference 2018. p. 1515–1522. WWW ’18, International World Wide Web Con-
    ferences Steering Committee, Republic and Canton of Geneva, CHE (2018).
    https://doi.org/10.1145/3184558.3191601
 3. Derthick, M., Kolojejchick, J., Roth, S.F.: An interactive visual query en-
    vironment for exploring data. In: Proceedings of the 10th Annual ACM
    Symposium on User Interface Software and Technology. p. 189–198. UIST
    ’97, Association for Computing Machinery, New York, NY, USA (1997).
    https://doi.org/10.1145/263407.263545
 4. Dunaiski, M., Greene, G.J., Fischer, B.: Exploratory search of academic publication
    and citation data using interactive tag cloud visualizations. Scientometrics 110(3),
    1539–1571 (2017)
 5. Fan, J., Keim, D.A., Gao, Y., Luo, H., Li, Z.: Justclick: Personalized image recom-
    mendation via exploratory search from large-scale flickr images. IEEE Transactions
    on Circuits and Systems for Video Technology 19(2), 273–288 (2008)
 6. Ferré, S., Hermann, A.: Semantic search: Reconciling expressive querying and ex-
    ploratory search. In: International semantic Web conference. pp. 177–192. Springer
    (2011)
 7. Ianina, A., Golitsyn, L., Vorontsov, K.: Multi-objective topic modeling for ex-
    ploratory search in tech news. In: Conference on Artificial Intelligence and Natural
    Language. pp. 181–193. Springer (2017)
 8. Koh, E., Kerne, A., Hill, R.: Creativity support: Information discovery and ex-
    ploratory search. In: Proceedings of the 30th Annual International ACM SIGIR
    Conference on Research and Development in Information Retrieval. p. 895–896.
    SIGIR ’07, Association for Computing Machinery, New York, NY, USA (2007).
    https://doi.org/10.1145/1277741.1277963
 9. Peltonen, J., Strahl, J., Floréen, P.: Negative relevance feedback for ex-
    ploratory search with visual interactive intent modeling. In: Proceedings of
    the 22nd International Conference on Intelligent User Interfaces. p. 149–159.
    IUI ’17, Association for Computing Machinery, New York, NY, USA (2017).
    https://doi.org/10.1145/3025171.3025222
10. Ruotsalo, T., Peltonen, J., Eugster, M.J.A., Glowacka, D., Floréen, P., Myllymäki,
    P., Jacucci, G., Kaski, S.: Interactive intent modeling for exploratory search. ACM
    Trans. Inf. Syst. 36(4) (Oct 2018). https://doi.org/10.1145/3231593
11. Singh, M., Cafarella, M.J., Jagadish, H.V.: Dbexplorer: Exploratory search in
    databases. In: Pitoura, E., Maabout, S., Koutrika, G., Marian, A., Tanca,
    L., Manolescu, I., Stefanidis, K. (eds.) Proceedings of the 19th Interna-
    tional Conference on Extending Database Technology, EDBT 2016, Bor-
    deaux, France, March 15-16, 2016. pp. 89–100. OpenProceedings.org (2016).
    https://doi.org/10.5441/002/edbt.2016.11
12. White, R.W., Roth, R.A.: Exploratory Search: Beyond the Query-
    Response      Paradigm.      Synthesis    Lectures    on    Information      Concepts,
    Retrieval,     and   Services,     Morgan      &    Claypool     Publishers     (2009).
    https://doi.org/10.2200/S00174ED1V01Y200901ICR003
13. Yogev, S., Roitman, H., Carmel, D., Zwerdling, N.: Towards expressive ex-
    ploratory search over entity-relationship data. In: Proceedings of the 21st
    International Conference on World Wide Web. p. 83–92. WWW ’12 Com-
    panion, Association for Computing Machinery, New York, NY, USA (2012).
    https://doi.org/10.1145/2187980.2187990
14. Zhang, X., Song, D., Priya, S., Heflin, J.: Infrastructure for efficient exploration of
    large scale linked data via contextual tag clouds. In: International Semantic Web
    Conference. pp. 687–702. Springer (2013)