<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Architecture for Cell-Centric Indexing of Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lixuan Qiu</string-name>
          <email>lixuanqi@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haiyan Jia</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian D. Davison</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science and Engineering, Lehigh University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Journalism and Communication, Lehigh University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Increasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufcient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a rst-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we de ne cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The twenty- rst century has experienced an information explosion; data is
growing exponentially and users' information retrieval needs are becoming much
more complicated [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Given people's increasing interests in datasets, there is
a need for user-friendly search services for data journalists, scientists, decision
makers, and the general public to locate datasets that can help them answer
their questions.
      </p>
      <p>Even though users, under many circumstances, are not experts in the domain
in which they search, they should be able to easily use such an application; the
query process should be responsive and e cient, the result should provide a
general picture of what the dataset is about, and o er enough information for
the users to know how likely the dataset will contain data that they need.</p>
      <p>We present the novel concept of cell-centric indexing as the solution for data
storage and querying. Traditional database management systems group data by
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
tables and then organize this data into rows and columns. Even column-oriented
databases still assume this model, although data is stored by column instead of
by row as in a relational database management system. These approaches are
suitable when the users want to construct queries based on their knowledge of the
schema, but what if they want to nd out how speci c data is stored? To enable
exible, schema-optional queries, we build inverted indices using individual cells
as the fundamental unit.</p>
      <p>These indices provide di erent elds that index both the content of the cell
and its context. For our purposes, the context includes other cell values in the
same row, the name of the column (if available), and metadata about the
containing dataset. This approach allows us to re ne our search by row descriptors,
column descriptors or both at the same time. In essence we free the data from
how it is structured, and schema information, when available, is merely one of
the many ways to locate the data of interest. Thus, we take the view that
fundamentally, users are searching for speci c data (i.e., particular cells or collections
thereof), and the tables are merely artifacts of how the data is stored.</p>
      <p>We recognize that this approach also has downsides. In particular, an index
of cells (and their contexts) will incur substantial storage overhead in
comparison to an index of datasets. Moreover, if the desired search result is one or
more datasets, at run-time there will be additional processing to assemble the
cell-speci c results to enable the retrieval and ranking at that level of
granularity. However, our cell-centric approach gives us some additional exibility and
we believe that good system design, appropriate data structures, and e cient
algorithms can ameliorate the costs.</p>
      <p>The contributions of this paper are: (1) We propose cell-centric indexing as
an innovative approach to an information retrieval system. A cell-centric index
enables a user to nd data without having to know the pre-existing structure of
each table; (2) We introduce the mechanisms of our dataset search engine. We
describe the structure and method of data storage and querying of our server;
(3) We examine the e ciency of indexing and querying datasets.</p>
      <p>The rest of the paper is organized as follows: we rst discuss related work,
brie y describe the idea of cell centric indexing and its advantages and
disadvantages, introduce the structure of our server and the methodology involved
in querying, and nally provide experimental results about building the indices
and querying them.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Scholars have investigated exploratory search to help searchers succeed in an
unfamiliar area by proposing novel information retrieval algorithms and systems;
some of them propose innovative user interfaces, while others try to predict
the user's information need and to use the prediction to better facilitate the
subsequent interaction.</p>
      <p>
        Derthick et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] describe a visual query language that dynamically links
queries and visualizations. The system helps a user to locate information in
a multi-object database, illustrates the complicated relationship between
attributes of multiple objects, and assists the user to clearly express their
information retrieval needs in their queries. Similarly, Yogev et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] demonstrate
an exploratory search approach for entity-relationship data. The approach
combines an expressive query language, exploratory search, and entity-relationship
graph navigation. Their work enables people with little to no query language
expertise to query rich entity-relationship data.
      </p>
      <p>
        In the domain of web search, Koh et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] devise a user interface that
supports creativity in education and research. The model allows users to send their
query to their desired commercial search engine or social platform in iterations.
As the system goes through each iteration, it will combine the text and image
results into a composition space. Addressing a similar problem, Bozzon et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
design an interactive user interface that employs exploratory search and Yahoo!
Query Language (YQL) to empower users to iteratively investigate results across
multiple sources.
      </p>
      <p>
        A tag cloud is a common and useful visualization of data that represents
relative importance or frequency via size. Some researchers have adapted this
idea to visualize query results. Fan et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] focus on designing an interactive
user interface with image clouds. The interface enables users to comprehend
their latent query intentions and direct the system to form their personalized
image recommendations. Dunaiski et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] design and evaluate a search engine
that incorporates exploratory search to ease researchers' scouting for academic
publications and citation data. Its user interface unites concept lattices and tag
clouds to present the query result in a readable composition to promote further
exploratory search. On the other hand, Zhang et al. focus their work on graph
data [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. They combine faceted browsing with contextual tag clouds to create
a system that allows users to rapidly explore knowledge graphs with billions
of edges by visualizing conditional dependencies between selected classes and
other data. Although they don't use tag clouds, Singh et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] also display
conditional dependencies in their data outline tool. For a given pivot attribute
and set of automatically determined compare attributes, they show conditional
values, grouped into clusters of interaction units.
      </p>
      <p>
        In addition, scholars investigate query languages and models. Ianina et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
concentrate on developing an exploratory search system that facilitates the user
having a way to conduct long text queries, while minimizing the risk of
returning empty results, since the iterative \query{browse{re ne" process [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] may
be time-consuming and require expertise. Meanwhile, Ferre and Hermann [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
focus more on the query language, LISQL, and they o er a search system that
integrates LISQL and faceted search. The system helps users to build complex
queries and enlightens users about their position in the data navigation process.
      </p>
      <p>
        Other researchers try to think ahead of the user: i.e., predict the user's search
intent so that they can better direct the search result. Peltonen et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] utilize
negative relevance feedback in an interactive intent model to direct the search.
Negative relevance feedback predicts the most relevant keywords, which are later
arranged in a radar graph where the center denotes the user, to represent the
user's intent. Likewise, Ruotsalo et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a similar intent radar model
that predicts a user's next query in an interactive loop. The model uses
reinforcement learning to control the exploration and exploitation of the results.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Cell-Centric Indexing</title>
      <p>We de ne a table as T = hl; H; V i where l is the label of the table, H =
hh1; h2; :::; hni is a list of the column headers, and V is an m n matrix of
the values contained in the table. Vi;j refers to the value in the i-th row and the
j-th column, which has the heading hj .</p>
      <p>Inverted indices are an important tool in information retrieval. An inverted
index consists of a lexicon and a posting list for each term. The lexicon is simply
an index of the unique terms contained in the document collection. The posting
list for a term w is a sorted list of documents (in reality, document identi ers)
that contain w.</p>
      <p>Given a conjunctive query Q = fq1; q2; :::g, the set of matching documents
can be found by simply intersecting the posting lists for each query term qi. If
the lexicon is sorted, then a binary search can nd each posting list in O(log n)
time, and since the posting lists are sorted, a parallel walk through these lists
can nd the intersection in linear time. This allows for very fast response times,
even when many documents are indexed.</p>
      <p>Typically, an information retrieval system allows indexing in di erent elds.
For example, the index for a collection of documents might have one eld for
title, another for subheadings, and another for body text. Among other things,
elds can be used to di erentiate the importance of a word based on where
it appears: a word that appears in the title of a document is probably more
meaningful than one that only appears once in its body.</p>
      <p>A nave approach to indexing a collection of datasets would be to simply
treat each table as a document, and have separate elds for the label, column
headings, and (possibly) values. When terms are used consistently and the user
is familiar with the terminology, this will often work well. However, this approach
has several weaknesses:
1. Any query on values has lost context of what column the value appears in
and what identifying information might be present elsewhere in the same
row. For example, a table that contains capitals like (Paris, France) and
(Austin, Texas) is unlikely to be relevant to a query about \Paris Texas"
2. It is di cult to determine which new terms can be used to re ne the query.</p>
      <p>Users would need to download some of the datasets and choose distinctive
terms from the most relevant ones.
3. A user's constraint could be represented in di erent tables in very di erent
ways. If the user is looking for \California Housing Prices", there may be a
table with some variant of that name, there may be a \Real Estate Prices"
table with rows speci c to California, or there may be a \Housing Prices"
table that has a column for each state, including California. A user should
be able to explore the collection to see how the data is organized and what
terminology is used.</p>
      <p>We have proposed cell-centric indexing as a novel way to address the problems
above. Rather than treating the table as the indexed object, each datum (cell
in the table) is an indexed object. In its simplest form, we propose four elds:
content : the value of the cell, title: the label of the dataset the cell appears in,
columnName: the header of the column the cell appears in, and rowContext :
the values in all cells in the same row as the indexed cell. Formally, a cell value
Vi;j from table T = hl; H; V i can be indexed with: content = Vi;j , title = l,
columnName = hj , and rowcontext = Sn</p>
      <p>k=1 Vi;k. This index would allow users to
nd all cells that have a column header token in common regardless of dataset,
or all cells that appear in the same row as some identifying token, or look for
the occurrence of speci c values in speci c columns.</p>
      <p>However, in this form, users still need to know which keywords to use and
which elds to use them in. A cell-centric index alone is not helpful to a user
who is not already familiar with the collection of datasets. In order to support
the user in exploring the data, we propose the abstraction conditional frequency
vectors (CFVs). Let I be a set of items, D be a set of descriptors (e.g., tags that
describe the items), and F I D be a set of item and descriptor pairs hxi; dii.
Let Q be a query, where Q(F ) F represents the pairs for only those items that
match Q. Then a CFV for Q and F is a set of descriptor-frequency pairs where
the frequency is the number of times that the corresponding descriptor occurs
within Q(F ): fhd; f i j f = #fhx; dij hx; di 2 Q(F )gg. For cell-centric indexing,
the items I are the set of all cells regardless of source dataset, and Fi pairs cells
with terms from the i-th eld. Typically, we sort the CFV in terms of descending
frequency.</p>
      <p>Figure 1 shows the response of our prototype interface to the query with
title=\olympics" and content =\athens". It displays a CFV for each eld as a
histogram; the longer the red bar, the more frequently that term co-occurs with the
query. The CFV for the title eld is hholympics,32i, hmedalists,21i,hlist,14i,...i.
As we can see, 15 datasets contain matches, most of which have \medalists" in
the title and \games" in the column name. The user can re ne this query and
create new histograms by clicking on any terms in the result. For example, if the
user clicks on \kenya" in the title histogram, the system will display a new set
of histograms summarizing the datasets that have \athens" as a content eld
and both \olympics" and \kenya" in the title. The user explores the dataset
collection by adding terms to and removing terms from the query.
4</p>
    </sec>
    <sec id="sec-4">
      <title>System Architecture</title>
      <p>The architecture of the system is depicted in Figure 2. At the core of our system
is an Elasticsearch server. Elasticsearch is a scalable, distributed search engine
that also supports complex analytics. Our system has two main functions: 1)
parse collections of datasets, map them into the elds of a cell-centric index,
and send indexing requests to Elasticsearch and 2) given a user query, issue a
series of queries to Elasticsearch and construct histograms (CFVs) for each eld.
In Elasticsearch, a mapping de nes how a document will be indexed: what elds
will be used and how they will be processed. In cell-centric indexing the cell is
the document, and our index must have elds that describe cells. Our mapping
is summarized in Table 1. In addition to the four elds mentioned in Section 3,
we have elds for the fullTitle (used to identify which speci c datasets match the
query) and metadata such as tags, notes, and organization. Note, that content
is divided into two elds: content and contentNumeric, for reasons that will be
described below. For each eld, we give its type and, if applicable, the analyzer
used to process text from the eld.</p>
      <p>We use three types of elds: text, keyword, and double. Text type elds are
tokenized and processed by word analyzers, whereas keyword type elds are
indexed as is (without tokenization or further processing). Double type elds are
used to store 64-bit oating point numbers. Most of our elds are text elds, but
contentNumeric is a double eld, which allows it to store both integer and real
numeric values, and fullTitle is a keyword eld, since we want users to be able
to view the complete name of the dataset in the result.</p>
      <p>All text elds require an analyzer which determines how to tokenize the eld
and if any additional processing is required. The stop analyzer divides text at
all non-letter characters and removes 33 stop words (such as \a", \the", '\to",
etc.). For most elds, we use the stop analyzer, but we use the wordDelimiter
analyzer for the colunnName eld. In addition to dividing text at all non-letter
characters, it also divides text at letter case transitions (e.g., \birthDate" is
tokenized to \birth" and \date"). This analyzer does not remove stop words.
The system loads each dataset using the following process:
1. Read the metadata, which can include title, tags, notes and organization. If
the original table is formatted as CSV, then this data might be contained in
a separate le in the same directory, or as a row in a repository index le.
If the table is formatted using JSON, the metadata may be speci ed along
with the content, and there may be many datasets described in a single le.
2. Read the column headings of the data: hh1; h2; :::; hni
3. For each row in the dataset:
(a) Read the row values: hv1; v2; :::; vni
(b) Create rowContext by concatenating the values in the row. Note, to
avoid creating di erent large context strings for each value in the row,
we create a single rowContext. This means that each value is also part
of its own row context. This decision helps make the system more e
cient. An additional e ciency consideration is that each value included
in rowContext is truncated to the rst 100 characters.
(c) Build an index request for each cell value vi. If the content is numeric
(integer or real), it will be indexed in the contentNumeric eld; otherwise
it is indexed in the content eld. The columnName eld will be indexed
with the corresponding header hi. The title is indexed twice, once as a
tokenized eld that can be used in queries, and again as keyword eld
that preserves the order of the title and can be used to precisely identify
the dataset the cell originated from. All other metadata elds are indexed
in a straight-forward way.
(d) For e ciency, index requests are batched and sent to the server in bulk.</p>
      <p>Synchronization is disabled during bulk loading to avoid excessive delays.
4.3</p>
      <sec id="sec-4-1">
        <title>Querying the index</title>
        <p>Algorithm 1 illustrates how a query is processed and how the results are
composed. We can get a CFV for a particular eld and query using Elasticsearch's
term aggregation feature. This feature returns a list of terms that appear in the
selected documents along with their frequency. Since we want users to be able to
more accurately locate data that matches their interests, each query generates
histograms for the title, columnName, content, rowContext, and fullTitle elds.</p>
        <p>Unlike textual terms, numeric terms exhibit a greater variability. Histograms
built using distinct numeric strings are unlikely to have signi cant value. For
example, \135", \135.0" and \1.35E+2" are all equivalent, while some users
might consider \135.0001" to be close enough. To address this, we create ranges
over numeric values. Elasticsearch supports a histogram aggregation which can
be used to build these distributions. The algorithm evaluated in this paper builds
10 buckets of equal size, although our team is currently experimenting with more
sophisticated algorithms, including one in which we remove outliers, compute the
mean and standard deviation over the remaining values, and then specify the
buckets to have a width of one standard deviation with one bucket centered over
the mean. Once the numeric content histogram is created, this is merged with
the content histogram.</p>
        <p>Algorithm 1: Process of Query Execution</p>
        <p>Data: Query q
1 Issue query q requesting term aggregations for title, columnName, content,
rowContext and fullTitle;
2 Build CFVs from the query response ;
3 if results contain contentNumeric data then
4 Reissue q requesting histogram aggregation on contentNumeric data ;
5 Merge results with content CFV ;
6 Return columnName, rowContext, title, content, and fullTitle CFVs ;</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this paper, we evaluate whether the proposed system can be scaled to
realworld collections of datasets while providing acceptable response times to user
queries. All of our experiments are conducted using a Lehigh-hosted server. The
server has one Intel(R) Xeon(R)Silver 4110 CPU @ 2.10GHz, with 8 cores (16
threads) and a total of 96GB RAM. It runs Java 11.0.7 and Elasticsearch 6.8.8.
5.1</p>
      <sec id="sec-5-1">
        <title>Scalability of Loading Datasets</title>
        <p>
          To evaluate the scalability of the system with respect to data loading, we load
two di erent collections of tables. The WikiTables corpus4 contains 1.6 million
tables extracted from Wikipedia. We created ve di erent subsets of the data,
where each subsequent set contains all of the data in the previous set. The
data.gov corpus [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] contains 7,485 tables downloaded from data.gov, the open
data portal of the United States federal government. We created three subsets
of this data, where each subsequent subset contained the prior subset. In both
cases, we identify the subsets by size of the raw data les. Due to the variability
of length in the values of di erent datasets, the number of cells is not necessarily
proportional to the directory size. Also, the WikiTables corpus is stored in JSON,
which leads to larger sizes than the CSV les of the data.gov corpus.
        </p>
        <p>We loaded each subset and recorded the resulting index size in MB and the
time to load the data in minutes. Table 2 displays the results for the WikiTables
corpus. On average, the system can index 1 million table cells per minute and
the resulting index is 100 bytes per cell. Table 3 shows the same experiment
conducted on three di erent sizes of data.gov data. Here, the system only loads
650; 000 cells per minute and index size is more variable, although the largest
set requires 110 bytes per cell. We hypothesize that the slower throughput of
data.gov is due to structural di erences in the tables: data.gov tables are likely
to have more columns and longer values in those columns, leading to signi cantly
longer rowContext. However, due to the e ciencies of inverted indices, this cost
is borne in indexing time (due to longer indexing messages), and the index size
is not a ected as much. Despite the di erence in throughput, in both cases the
load time scales linearly with the number of the cells, as depicted in Figure 3
and Figure 4.</p>
        <p>We will note that the resulting Elasticsearch index is always larger than the
original input data. For WikiTables, this is about 2.5x larger, while for data.gov
it is 1.5x to 1.7x larger5. We have experimented with an alternative design that
uses two indices: one for indexing cell-speci c information and another for
indexing table-speci c information. The alternate design reduced index storage,
but increased overall query time. Given that disk space is relatively cheap, and
Elasticsearch makes it easy to add additional machines to scale up the size of a
cluster, we chose to optimize query speed.
4 http://websail-fe.cs.northwestern.edu/TabEL/tables.json.gz
5 For the smallest size, the di erence is 6.3x larger, but we believe this is due to
mandatory overhead</p>
        <p>Directory Size (MB) Size of Index (MB) Num Cells Time (mins)
99.3 249.0 2,432,826 2.4
500.3 1,292.6 12,888,268 13.0
999.5 2,614.3 26,133,840 26.6
2,002.0 5,255.6 52,326,690 53.5
3,940.0 10,277.5 102,926,958 106.3
Tab. 2: Result of uploading various sizes of folders of JSON les from WikiTables
Directory Size (MB) Size of Index (MB) Num Cells Time (mins)
26.0 163.3 1,835,238 2.9
466.3 694.0 5,398,841 8.7
756.4 1295.2 11,868,845 17.9</p>
        <p>Tab. 3: Result of Uploading Various Sizes of Folders of CSV Files from DataGov
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Extensive Query Simulation</title>
        <p>With the purpose of testing the relationship between size of the index and query
e ciency, we created three data collections from the WikiTables corpus. This
results in data les of size 0.8GB, 2.4GB, and 4GB; where each smaller index's
source is a subset of each larger index. Table 4 shows indexes' basic information.</p>
        <p>SoSuirzcee(FGoBld)er CNelulsmibneirndofex Index Size (GB)
0.8 20,381,595 2.1
2.4 61,602,873 6.2
4.0 102,926,958 10.3</p>
        <p>Tab. 4: Statistics on WikiTable datasets used for query experiment</p>
        <p>In order to evaluate the system's performance over a wide variety of
possible queries, we designed a random query model. The rst step of user query
simulation is to create seed lists containing query terms that we can use as the
\user's" starting lter. To do so, we build four CFVs using the empty query: title,
columnName, content, and rowContext. Then we choose an appropriate range of
frequency to perform our test; high frequency seeds are not desired because they
do not su ciently lter the results to be useful in exploration (in e ect, they
are analogous to stop words in traditional search); low frequency seeds are not
suitable either, since we are simulating real users' choices and they are less likely
to query for an uncommon word. Given these criteria, we chose to select terms
Fig. 4: Load time vs. number of cells for
data.gov corpus. For comparison, the
relevant range of WikiTables is also plotted.
with a frequency between 10,000 to 50,000. Then we create four lists, one each
for seeds from title, columnName, rowContext, and content. Notably, content's
seed list will only contain textual terms, since numerical terms in a dataset's
content eld are indexed separately by the contentNumeric eld. We also
generate a random-category seed list that has all the seeds from the lists above,
so that when we arbitrarily select seeds from this list, each seed's category is
randomized. We expect that users will rarely query on a speci c number, and for
many numbers, their frequency would be too small anyway, so only textual seeds
are included in the seed list. Although some numbers like years might appear in
many user queries, we feel this simpli cation keeps our set of test queries clean.</p>
        <p>In order to simulate real-world testing conditions, we have designed a model
that predicts the behavior of a normal user: we randomly choose a query from
the seed list, execute the query, randomly select one of the resulting histograms,
randomly select one of the top 20 terms from the resulting histogram, append
the new eld:term pair to the existing query and repeat. This process stops after
the query has ve terms, or a resulting histogram has two or fewer terms.</p>
        <p>We then analyze the query performance by comparing di erent starting
categories on the same index or same starting category on three di erent sizes
(0.8GB, 2.4GB, 4.0GB) of indexes. In this process, timers are set at di erent
points to measure the time consumed. We run one thousand iterations of the
process, recording the queries and their times to a log le.</p>
        <p>Since the nal query of each iteration has a maximum length of ve-terms,
we can derive at most ve di erent queries of di erent length from that original
query. To illustrate this, suppose after an iteration, the nal query looks like:
a&amp;b&amp;c&amp;d &amp;e, where each letter represents a one-term query, eg. title:football. We
can obtain ve di erent queries: a, a&amp;b, a&amp;b&amp;c, a&amp;b&amp;c&amp;d, and a&amp;b&amp;c&amp;d &amp;e.</p>
        <p>Five thousand query processes have been performed on each index, one
thousand for each category (title, rowContext, content, columnName, and random).
Generally, one thousand one-term queries are guaranteed, but as the query
becomes longer, the number of queries with corresponding length will be smaller.
Queries with more terms are more restrictive, leading to fewer results. Thus, the
likelihood of satisfying the stopping criteria of too few results becomes greater.
Table 5 displays the nal number of queries based on the number of terms and
the starting category on the 4.0 GB index.</p>
        <p>column name content title row context random total
1-term 1,000 1,000 1,000 1,000 1,000 5,000
2-term 994 999 994 1,000 999 4,986
3-term 911 975 930 976 953 4,745
4-term 706 857 743 835 786 3,927
5-term 508 662 499 607 597 2,873
total 4,119 4,493 4,166 4,418 4,335 21,531
Tab. 5: Total number of queries of each length for each starting category on the 4.0
GB index</p>
        <p>Figure 5 summarizes the performance of the system, assuming the rst query
term comes from the title eld. Here we show the average query time (in
milliseconds) for each index size for each length of query. As the queries grow in length,
execution time decreases. The gaps between 1-term and 2-term, 2-term and
3term are the largest, since it is much more challenging to further improve the
e ciency when the time usage is at around 50 ms level. It can also be observed
that, although queries on larger indexes take more time, as the size of the index
increases linearly, the time consumed grows sublinearly. For example, one-term
queries on the largest index are only 1.7x slower than those on the smallest,
even though this index is 5x larger. This trend holds regardless of the length of
queries or the choice of the query's starting category. Figure 6 illustrates this
by showing a similar pattern when the starting term is selected from the four
categories at random. Together, these results imply that the system can scale to
much larger indices with only a small degradation in performance.</p>
        <p>Next, we investigate whether the type of the starting eld has any signi cant
impact on query performance. Table 6 shows the average and standard deviation
of time consumed by the ve di erent categories of one-term queries when issued
against the 4.0GB collection. For the last row, i.e. random category, the query
elds are randomized to give a more general sense of the overall performance of
the system. In addition to total query time, we break time down into textual
query time (generating the standard histograms for text elds) and numeric
query time (generating the numeric range histograms). Here, we see that title
queries have the greatest variation in overall time, and that for all elds except
content, the numeric range query processing can take a signi cant fraction of
the overall query time (30 - 50%). Note, content queries always have a 0 numeric
query time. This is because all of our queries are seeded with text terms and
numeric range processing is only computed when the ltered cells are numeric.
However, since the query term is textual content, there can be no numeric cells
in the result, and thus no numeric range processing is computed.</p>
        <p>Textual Query Numeric Query Total Query</p>
        <p>Time (ms) Time (ms) Time (ms)
Category Avg. Std. dev Avg. Std. dev Avg. Std. dev</p>
        <p>title 139 112 93 18 232 114
content 147 27 0 0 147 27
rowContext 122 21 121 23 243 36
columnName 126 31 111 32 237 49</p>
        <p>random 157 86 87 47 244 92
Tab. 6: Average and standard deviation of time consumed by textual and numerical
query on one-term query</p>
        <p>Using the largest index, Figure 7 displays the average query times across
the ve di erent eld types on three di erent query lengths. For most types of
queries, the one-term query takes 230-250 ms. However, queries starting with a
content term take less: only 146.9 on average. Recall from Table 6 that content
queries do not need numeric range processing, and are thus faster. Furthermore,
it can also be observed that queries starting with rowContext will often take more
time than other types of queries. These queries have the longest average times
for both three-term and ve-term queries (100.8 ms and 80.6 ms, respectively).
We hypothesize this is because many rowContext terms have a large inverted
index; that is they are associated with many cells.</p>
        <p>One oddity of Figure 7 is that for one-terms queries, rowContext is actually
faster on average than for random queries. In order to understand if outliers play
a role in this observation, we create a box plot of the same data. From Figure 8,
we can see that certain types of queries have more variability in execution times
than others. In particular, one-term queries starting with a random category
have more outliers above the third quartile than rowContext one-term queries.
We also see that the median is slightly below the median of rowContext
oneterm queries. Thus, it is these outliers that lead to a higher average query time.</p>
        <p>Note, given that both title queries and columnName queries have many outliers,
we assume that these are the types of queries that contribute the most to the
variability in the random query result. It can also be observed that, of over 20,000
random queries issued over the index, only three took longer than a second, and
only one of those was (slightly) over two seconds.
Cell-centric indexing is an innovative architecture that enables anyone to conduct
schema-less queries on datasets in an exploratory fashion. In this paper, we
discuss the scalability and query e ciency of our system. We were able to load
the entire 1.6 million table WikiTables corpus in 106 minutes and one-term
queries against this index averaged less than 250 ms each. Queries with more
terms took less time. In general, the server is able to parse and upload source
les in linear time; the index will be approximately 2.6 times as large as the size
of the source folder, which is an acceptable trade o between inexpensive disk
space and running time.</p>
        <p>Despite the small size of our test server and being unable to handle a large
number of requests and responses as currently con gured, we believe it can
provide an easy and e cient way of locating a dataset of interest. Future work
includes system performance enhancements, more usability features and a
thorough user evaluation of the interface.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Acknowledgments</title>
        <p>This material is based upon work supported by the National Science Foundation
under Grant No. III-1816325. Alex Johnson, Xuewei Wang, dePaul Miller, Drake
Johnson, and Keith Register contributed to the implementation of the system.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bozzon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brambilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fraternali</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Liquid query: Multi-domain exploratory search on the web</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on World Wide Web</source>
          . p.
          <volume>161</volume>
          {
          <fpage>170</fpage>
          . WWW '
          <volume>10</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2010</year>
          ). https://doi.org/10.1145/1772690.1772708
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , He in, J.,
          <string-name>
            <surname>Davison</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          :
          <article-title>Generating schema labels through dataset content analysis</article-title>
          .
          <source>In: Companion Proceedings of the The Web Conference</source>
          <year>2018</year>
          . p.
          <volume>1515</volume>
          {
          <fpage>1522</fpage>
          . WWW '18,
          <string-name>
            <given-names>International</given-names>
            <surname>World Wide Web Conferences Steering Committee</surname>
          </string-name>
          , Republic and Canton of Geneva, CHE (
          <year>2018</year>
          ). https://doi.org/10.1145/3184558.3191601
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Derthick</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolojejchick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.F.</given-names>
          </string-name>
          :
          <article-title>An interactive visual query environment for exploring data</article-title>
          .
          <source>In: Proceedings of the 10th Annual ACM Symposium on User Interface Software and Technology</source>
          . p.
          <volume>189</volume>
          {
          <fpage>198</fpage>
          . UIST '
          <volume>97</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>1997</year>
          ). https://doi.org/10.1145/263407.263545
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dunaiski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greene</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Exploratory search of academic publication and citation data using interactive tag cloud visualizations</article-title>
          .
          <source>Scientometrics</source>
          <volume>110</volume>
          (
          <issue>3</issue>
          ),
          <volume>1539</volume>
          {
          <fpage>1571</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keim</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Justclick: Personalized image recommendation via exploratory search from large-scale ickr images</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ),
          <volume>273</volume>
          {
          <fpage>288</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Hermann, A.:
          <article-title>Semantic search: Reconciling expressive querying and exploratory search</article-title>
          . In: International semantic Web conference. pp.
          <volume>177</volume>
          {
          <fpage>192</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ianina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golitsyn</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vorontsov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>: Multi-objective topic modeling for exploratory search in tech news</article-title>
          .
          <source>In: Conference on Arti cial Intelligence and Natural Language</source>
          . pp.
          <volume>181</volume>
          {
          <fpage>193</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Koh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kerne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
          </string-name>
          , R.:
          <article-title>Creativity support: Information discovery and exploratory search</article-title>
          .
          <source>In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . p.
          <volume>895</volume>
          {
          <fpage>896</fpage>
          . SIGIR '
          <volume>07</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2007</year>
          ). https://doi.org/10.1145/1277741.1277963
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Peltonen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strahl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Floreen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Negative relevance feedback for exploratory search with visual interactive intent modeling</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Intelligent User Interfaces</source>
          . p.
          <volume>149</volume>
          {
          <fpage>159</fpage>
          . IUI '
          <volume>17</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2017</year>
          ). https://doi.org/10.1145/3025171.3025222
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ruotsalo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peltonen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eugster</surname>
            ,
            <given-names>M.J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glowacka</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Floreen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Myllymaki,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Jacucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Kaski</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>Interactive intent modeling for exploratory search</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>36</volume>
          (
          <issue>4</issue>
          ) (Oct
          <year>2018</year>
          ). https://doi.org/10.1145/3231593
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          :
          <article-title>Dbexplorer: Exploratory search in databases</article-title>
          . In: Pitoura,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Maabout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Koutrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Marian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Manolescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the 19th International Conference on Extending Database Technology, EDBT</source>
          <year>2016</year>
          , Bordeaux, France, March 15-16,
          <year>2016</year>
          . pp.
          <volume>89</volume>
          {
          <fpage>100</fpage>
          . OpenProceedings.org (
          <year>2016</year>
          ). https://doi.org/10.5441/002/edbt.
          <year>2016</year>
          .11
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>White</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>Exploratory Search: Beyond the QueryResponse Paradigm</article-title>
          .
          <source>Synthesis Lectures on Information Concepts</source>
          , Retrieval, and Services, Morgan &amp; Claypool Publishers (
          <year>2009</year>
          ). https://doi.org/10.2200/S00174ED1V01Y200901ICR003
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yogev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roitman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carmel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zwerdling</surname>
          </string-name>
          , N.:
          <article-title>Towards expressive exploratory search over entity-relationship data</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on World Wide Web</source>
          . p.
          <volume>83</volume>
          {
          <fpage>92</fpage>
          . WWW '12 Companion, Association for Computing Machinery, New York, NY, USA (
          <year>2012</year>
          ). https://doi.org/10.1145/2187980.2187990
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Priya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , He in, J.:
          <article-title>Infrastructure for e cient exploration of large scale linked data via contextual tag clouds</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . pp.
          <volume>687</volume>
          {
          <fpage>702</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>