<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Conference on Design of
Experimental Search &amp; Information REtrieval Systems, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploring Datasets via Cell-Centric Indexing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jef Heflin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian D. Davison</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haiyan Jia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science &amp; Engineering, Lehigh University</institution>
          ,
          <addr-line>113 Research Dr., Bethlehem, PA, 18015</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Journalism and Communication, Lehigh University</institution>
          ,
          <addr-line>33 Coppee Dr., Bethlehem, PA, 18015</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our interface provides users with an overview of a dataset repository, and allows them to eficiently use various facets to explore the collection and identify datasets that match their interests.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;cell-centric indexing</kwd>
        <kwd>dataset search</kwd>
        <kwd>exploratory interface</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
not be able to determine if the table is a perfect match
until they have downloaded the (potentially large) table.</p>
      <p>The twenty-first century has experienced an informa- Studies have shown that understanding what is inside
tion explosion; data is growing exponentially and users’ the content of a dataset, rather than simply the dataset
information retrieval needs are becoming much more descriptions and metadata, could be critical for their
evalcomplicated [1]. Given people’s increasing interests in uation of whether any of the search results suficiently
datasets, there is a need for user-friendly search services matches the search need, especially for non-expert users.
for data journalists, scientists, decision makers, and the For instance, a recent user study [2] has revealed that
general public to locate datasets that can meet their data query refinement, as a result of unsatisfying search
reneeds. sults, is negatively associated with user experience with</p>
      <p>Even though users, under many circumstances, are the dataset search tools. What reduces the need for query
not experts in the domain in which they search, they refinement is a preview of the dataset content, which
should be able to easily use such an application; the query helps users gauge the relevance of the datasets.
Simiprocess should be responsive and eficient. The result larly, an experimental study that explores novel dataset
should provide a general picture of what the dataset is search engine prototypes has found that interfaces with
about, and ofer enough information for the searcher to a content preview feature were perceived as more usable.
know how likely the dataset will contain data that they In particular, non-expert users reported greater benefits
look for. from the content preview, as they rated the interfaces</p>
      <p>Traditional database management systems group data with higher levels of usefulness, ease of use, usability,
by tables and then organize this data into rows and and technology adoption intention, than expert users
columns. When users are aware of the database schema, [3]. These indicate the strong need for understanding
they can construct queries, but what if users are simply the actual content of datasets, even at the cell level.
trying to find which tables in a large data lake are rel- To enable suficient query refinement for
schemaevant to their needs? One approach is to simply index optional queries, we present the novel concept of
cellinformation about the table in various fields: e.g., title, centric indexing. The key idea is that we use individuals
description, columns, etc. While this approach may be cells of a table as the fundamental unit and build inverted
suficient for some queries, in many cases the user will indices on these cells. These indices provide diferent
ifelds that index both the content of the cell and its
context. For our purposes, the context includes other cell
values in the same row, the name of the column (if
available), and metadata about the containing dataset. This
approach allows us to refine our search by row
descriptors, column descriptors or both at the same time. In
essence we free the data from how it is structured, and
schema information, when available, is merely one of the
many ways to locate the data of interest. Thus, we take
the view that fundamentally, users are searching for
specific data (i.e., particular cells or collections thereof), and
the tables are merely artifacts of how the data is stored. Derthick et al. [8] describe a visual query language</p>
      <p>We recognize that this approach also has downsides. that dynamically links queries and visualizations. The
In particular, an index of cells (and their contexts) will system helps a user to locate information in a
multiincur substantial storage overhead in comparison to an object database, illustrates the complicated relationship
index of dataset metadata. Moreover, if the desired search between attributes of multiple objects, and assists the
result is one or more datasets, at run-time there will be user to clearly express their information retrieval needs
additional processing to assemble the cell-specific results in their queries. Similarly, Yogev et al. [9] demonstrate an
to enable the retrieval and ranking at that level of granu- exploratory search approach for entity-relationship data,
larity. However, our cell-centric approach gives us some combining an expressive query language, exploratory
additional flexibility and we believe that good system de- search, and entity-relationship graph navigation. Their
sign, appropriate data structures, and eficient algorithms work enables people with little to no query language
can ameliorate the costs. expertise to query rich entity-relationship data.</p>
      <p>This paper incorporates material previously presented In the domain of web search, Koh et al. [10] devise a
in a poster [4] and a workshop [5]. Our contributions of user interface that supports creativity in education and
this paper are: research. The model allows users to send their query to
their desired commercial search engine or social platform
• We propose cell-centric indexing as an innovative in iterations. As the system goes through each iteration,
approach to an information retrieval system. A it will combine the text and image results into a
compocell-centric index enables a user to find data with- sition space. Addressing a similar problem, Bozzon et
out having to know the pre-existing structure of al. [11] design an interactive user interface that employs
each table; exploratory search and Yahoo! Query Language (YQL)
• We describe the mechanisms of one implementa- to empower users to iteratively investigate results across
tion of a cell-centric dataset search engine. We multiple sources.
describe the structure and method of data storage A tag cloud is a common and useful visualization of
and querying of our server; and, data that represents relative importance or frequency via
• We describe a novel prototype interface that lever- size. Some researchers have adapted this idea to visualize
ages cell-centric indexing in order to give users query results. Fan et al. [12] focus on designing an
insummaries of a dataset repository in terms of ti- teractive user interface with image clouds. The interface
tles, content, and column names. The user can enables users to comprehend their latent query
inteniflter on any of these facets to generate more spe- tions and direct the system to form their personalized
cific summaries. image recommendations. Dunaiski et al. [13] design and
evaluate a search engine that incorporates exploratory</p>
      <p>The rest of the paper is organized as follows: we first search to ease researchers’ scouting for academic
publicadiscuss related work, briefly describe the idea of cell- tions and citation data. Its user interface unites concept
centric indexing and its advantages and disadvantages, lattices and tag clouds to present the query result in
introduce the structure of our server and the methodol- a readable composition to promote further exploratory
ogy involved in querying, and finally describe a prototype search. On the other hand, Zhang et al. focus their work
interface. on knowledge graph data [14]. They combine faceted
browsing with contextual tag clouds to create a system
2. Related Work that allows users to rapidly explore graphs with billions
of edges by visualizing conditional dependencies between
Scholars have investigated exploratory search to help selected classes and other data. Although they don’t use
searchers succeed in an unfamiliar area by proposing tag clouds, Singh et al. [15] also display conditional
denovel information retrieval algorithms and systems; some pendencies in their data outline tool. For a given pivot
of them propose innovative user interfaces, while oth- attribute and set of automatically determined compare
ers try to predict the user’s information need and to attributes, they show conditional values, grouped into
use the prediction to better facilitate the subsequent in- clusters of interaction units.
teraction. Chapman et al. [6] have reviewed diferent Other scholars have investigated query languages and
approaches to dataset search. Google’s dataset search models. Ianina et al. [1] concentrate on developing an
[7] is an example of a traditional approach to indexing exploratory search system that facilitates the user
havweb datasets: the system crawls the Web and indexes ing a way to conduct long text queries, while
minimizdataset that have metadata expressed in the schema.org ing the risk of returning empty results, since the
itera(or a related) format. The only required properties are tive “query–browse–refine” process [ 16] may be
timename and description. consuming and require expertise. Meanwhile, Ferré and
Hermann [17] focus more on the query language, LISQL,
and they ofer a search system that integrates LISQL and table as the indexed object, each datum (cell in the table)
faceted search. The system helps users to build complex is an indexed object. In its simplest form, we propose
queries and enlightens users about their position in the four fields: content: the value of the cell, title: the label of
data navigation process. the dataset the cell appears in, columnName: the header</p>
      <p>Yet another approach is to predict the user’s search of the column the cell appears in, and rowContext: the
intent so that better search results can be presented. Pel- values in all cells in the same row as the indexed cell.
Fortonen et al. [18] utilize negative relevance feedback in mally, a cell value , from table  = ⟨, ,  ⟩ can be
inan interactive intent model to direct the search. Negative dexed with: content = , , title = , columnName = ℎ ,
relevance feedback predicts the most relevant keywords, and rowcontext = ⋃︀=1 ,. This index would allow
which are later arranged in a radar graph where the users to find all cells that have a column header token in
center denotes the user, to represent the user’s intent. common regardless of dataset, or all cells that appear in
Likewise, Ruotsalo et al. [19] propose a similar intent the same row as some identifying token, or look for the
radar model that predicts a user’s next query in an inter- occurrence of specific values in specific columns.
active loop. The model uses reinforcement learning to However, in this form, users still need to know which
control the exploration and exploitation of the results. keywords to use and which fields to use them in. A
cellcentric index alone is not helpful to a user who is not
already familiar with the collection of datasets. In order
3. Cell-Centric Indexing to support the user in exploring the data, we propose
We define a table as  = ⟨, ,  ⟩ where  is the label the abstraction conditional frequency vectors (CFVs). Let
of the table,  = ⟨ℎ1, ℎ2, ..., ℎ⟩ is a list of the column  be a set of items,  be a set of descriptors (e.g., tags
headers, and  is an ×  matrix of the values contained that describe the items), and  ⊆  ×  be a set of item
in the table. , refers to the value in the -th row and and descriptor pairs ⟨, ⟩. Let  be a query, where
the -th column, which has the heading ℎ . We note ( ) ⊆  represents the pairs for only those items that
that this model can be easily extended to include other match . Then a CFV for  and  is a set of
descriptormetadata, as appropriate. frequency pairs where the frequency is the number of</p>
      <p>A naïve approach to indexing a collection of datasets times that the corresponding descriptor occurs within
would be to simply treat each table as a document, and ( ): {⟨,  ⟩ |  = #{⟨, ⟩| ⟨, ⟩ ∈ ( )}}. For
have separate fields for the label, column headings, and cell-centric indexing, the items  are the set of all cells
re(possibly) values. When terms are used consistently and gardless of source dataset, and  pairs cells with terms
the user is familiar with the terminology, this may work from the -th field. For example, if a cell 5 was in a
well. However, this approach has several weaknesses: column titled "Real Estate Price," then 
includes the pairs ⟨5, ⟩, ⟨5, ⟩, and ⟨5, ⟩.
• Any query on values has lost context of what Typically, we sort the CFV in terms of descending
frecolumn the value appears in and what identify- quency.
ing information might be present elsewhere in
the same row. For example, a table that contains
capitals like (Paris, France) and (Austin, Texas) 4. System Architecture
is unlikely to be relevant to a query about “Paris</p>
      <p>Texas” but would otherwise match. The architecture of the system is depicted in Figure 1. At
• It is dificult to determine which new terms can the core of our system is an Elasticsearch server.
Elasticbe used to refine the query. Users would need search [20] is a scalable, distributed search engine that
to download some of the datasets and choose also supports complex analytics. Our system has two
distinctive terms from the most relevant ones. main functions: 1) parse collections of datasets, map
• A user’s constraint could be represented in dif- them into the fields of a cell-centric index, and send
indexferent tables in very diferent ways. If the user ing requests to Elasticsearch; and, 2) given a user query,
is looking for “California Housing Prices”, there issue a series of queries to Elasticsearch and construct
may be a table with some variant of that name, histograms (CFVs) for each field. The Query Processor
there may be a “Real Estate Prices” table with translates our high-level query API into specific
Elasticrows specific to California, or there may be a Search queries, and assembles the results into CFVs.
“Housing Prices” table that has a column for each
state, including California. A user should be able 4.1. Index Definition
to explore the collection to see how the data is
organized and what terminology is used.</p>
    </sec>
    <sec id="sec-2">
      <title>In Elasticsearch, a mapping defines how a document will</title>
      <p>be indexed: what fields will be used and how they will
be processed. In cell-centric indexing the cell is the</p>
    </sec>
    <sec id="sec-3">
      <title>We have proposed cell-centric indexing as a novel way to address the problems above. Rather than treating the</title>
      <p>document, and our index must have fields that describe
cells. Our mapping is summarized in Table 1. In addition
to the four fields mentioned in Section 3, we have fields
for the fullTitle (used to identify which specific datasets
match the query) and metadata such as tags, notes,
organization, and setId. The setId allows us to distinguish
between diferent datasets with the same title, and to get
an accurate count of how many datasets match a query.
Note, that content is divided into two fields: content and
contentNumeric, for reasons that will be described below.
For each field, we give its type and, if applicable, the
analyzer used to process text from the field.</p>
      <p>We use three types of fields: text, keyword, and double.
Text type fields are tokenized and processed by word
analyzers, whereas keyword type fields are indexed as
is (without tokenization or further processing). Double
type fields are used to store 64-bit floating point numbers.
Most of our fields are text fields, but contentNumeric is a
double field, which allows it to store both integer and real
numeric values, and both fullTitle and setid are keyword
ifelds, since we want users to be able to view the complete
name of the dataset in the result, and there is no need to
parse setIds.</p>
      <p>All text fields require an analyzer which determines
how to tokenize the field and if any additional processing
is required. We use two built-in Elasticsearch analyzers:
the stop analyzer divides text at all non-letter characters
and removes 33 stop words (such as “a”, “the”, “to”, etc.).
For most fields, we use the stop analyzer, but we use
the wordDelimiter analyzer for the colunnName field. In
addition to dividing text at all non-letter characters, it
also divides text at letter case transitions (e.g., “birthDate”
is tokenized to “birth” and “date”). This analyzer does
not remove stop words.</p>
      <sec id="sec-3-1">
        <title>4.2. Indexing a Dataset</title>
        <p>The system loads each dataset using the following
process:
1. Read the metadata, which can include title, tags,
notes and organization. If the original table is
formatted as CSV, then this data might be contained
in a separate file in the same directory, or as a row
in a repository index file. If the table is formatted
using JSON, the metadata may be specified along
with the content, and there may be many datasets
described in a single file.
2. Read the column headings ⟨ℎ1, ℎ2, ..., ℎ⟩
3. For each row in the dataset:
a) Read the row values: ⟨1, 2, ..., ⟩
b) Create rowContext by concatenating the
values in the row. Note, to avoid
creating diferent large context strings for each
value in the row, we create a single
rowContext. This means that each value is also
part of its own row context. This decision
helps make the system more eficient. An
additional eficiency consideration is that
each value included in rowContext is
truncated to the first 100 characters.
c) Build an index request for each cell value
. If the content is numeric (integer or
real), it will be indexed in the
contentNumeric field; otherwise it is indexed in the
content field. The columnName field will
be indexed with the corresponding header
ℎ. The title is indexed twice, once as a
tokenized field that can be used in queries,
and again as a keyword field that preserves
the order of the title and can be used to
precisely identify the dataset the cell
originated from. All other metadata fields are
indexed in a straight-forward way.</p>
        <p>Field
columnName</p>
        <p>content
contentNumeric
rowContext</p>
        <p>title
fullTitle
tags
notes
organization
setid</p>
        <p>Type
text
text
double
text
text
keyword
text
text
text
keyword</p>
        <p>Analyzer
wordDelimiter
stop
N/A
stop
stop
N/A
stop
stop
stop
N/A</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. Querying the index</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Our Query Processor takes a conjunctive, fielded query</title>
      <p>and returns a histogram for each response field. The
response fields are fields that contain information that helps
the user understand the characteristics of cells that match
the query. Currently, response fields are title,
columnName, content, rowContext, and fullTitle. Given a query
, the query process is:
5. If  &lt;  − 1.5 and  &gt;  + 1.5 (i.e., numeric
data is not particularly skewed), build an
aggregation query for contentNumeric data using ranges
calculated from  and  : the lowest range is [,
 – 1.5 ] and the highest is [ + 1.5 , ], where
there are 3 intermediate ranges of uniform size
with the middle range  – 0.5 ,  + 0.5 ]. If data
are skewed, the ranges are shifted appropriately.
6. Issue an Elasticsearch histogram aggregation
query with the calculated ranges. Treat each
range as a content term, and insert these terms
and their frequencies into the the content CFV.</p>
      <p>7. Return CFVs for each response field.</p>
      <p>Much of the processing above allows the system to
1. Issue query  requesting term aggregations for ti- dynamically determine buckets for numeric content that
tle, columnName, content, rowContext and fullTi- provide a useful picture of its distribution. Unlike textual
tle, Term aggregations are a feature of Elastic- terms, numeric terms exhibit greater variability.
Hissearch that return a list of terms that appear in the tograms built using distinct numeric strings are unlikely
selected documents, along with their frequency, to have significant value. For example, “135”, “135.0” and
i.e., CFV’s for . “1.35E+2” are all equivalent, while many users might
con2. Calculate the min  and max  for matching con- sider “135.0001” to be close enough. To address this, we
tentNumeric data. create ranges over numeric values. Our approach
com3. Select a representative set  of matching numeric putes the mean and standard deviation over the middle
content by issuing a percentile query against the 90% of data, thus removing the influence of outliers, and
contentNumeric field that excludes the top and then specifies the buckets to have a width of one standard
bottom 5 percent of the data. deviation with one bucket centered over the mean. Once
4. Calculate the mean  and standard deviation  the histogram of numeric ranges is created, its data is
of set  . merged with the content histogram to produce a single
histogram that shows frequencies of textual terms and</p>
    </sec>
    <sec id="sec-5">
      <title>An example of a typical use case is demonstrated using</title>
      <p>Figures 2-4. In this specific case, the user wants to find a
dataset containing data on Kenya’s performance in the
2004 Athens Olympics. Initially, the user is presented
with the graphs in Fig. 2. These histograms show the
most frequent title and column terms in the collection
of indexed datasets. However, the example histograms
do not initially show anything regarding the Olympics.
By using the “More Items” button at the bottom of the
title histogram the user can find the term Olympics and
add it directly to their query. After this term is added the
screen changes to that shown in Fig. 3. The user can now
look through all 4 histograms and decide which term best
helps them get to their desired data. Once again using
the “More Items” button, the term Athens can be found</p>
      <sec id="sec-5-1">
        <title>5.1. Pre-query Histograms</title>
        <p>Before any search parameters are set, the user is shown
two pre-query histograms that return up to the 50 most
frequent title and column name tokens within the current
repository (see Fig. 2). Column name and title histograms
provide a good overview and are vital in allowing the
user to explore the datasets without prior knowledge of
the contents. The pre-query histograms are presented to
the user when there are no active queries, such as when
the page is initially loaded or when all queries have been
deleted. Clicking on a histogram bar will automatically
add the corresponding term to the query and generate
the standard set of histograms.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results Histograms</title>
        <p>The standard screen displays the user’s current query
and five histograms. Each histogram is associated with a
ifeld, and tokens are sorted in descending frequency of
cooccurrence with the query. The length of a bar indicates
how many cells match the query. As with the pre-query
histograms, clicking on a bar adds the associated term to
the query, and generates a new result histogram. Below Figure 4: Results of Full Title Histogram with query:
each histogram is an option to provide more results on the title=olympics, content=athens
histogram. Initially, each histogram presents the top 10
results, however, the top 25 results are pre-fetched which
allows the newly requested results to be automatically
added to the histogram. content=“athens”. For this refinement, we show the Full</p>
        <p>Due to the connection between the count of matched Title histogram (see Fig. 4). In this histogram, the bars
cells and bar length, there is the possibility that the first represent the number of cells in a data set that match the
bar will be significantly larger than all remaining bars, user’s query. The user can add this bar to the query to get
making them dificult to see or select. To combat this, specific information about the distribution of terms in
we compare the counts of the two most frequent results. the chosen dataset. Additionally, this enables the option
If the first result contains 10 times more hits than the to search for the dataset, which is accomplished using a
second most frequent we change the scale of the his- Google query of the dataset’s full title.1 The user can
contograms to logarithmic, thus making it easier to visualize tinue to explore the dataset collection by adding terms
distinctions in skewed distributions. to and removing terms from the query.</p>
        <p>Figure 3 shows the response of our prototype
interface to the query with title=“olympics”. It displays a 6. Conclusion
CFV for each field as a histogram; the longer the red bar,
the more frequently that term co-occurs with the query. We have proposed cell-centric indexing as an
innovaAs we can see, 318 datasets contain matches, and after tive approach to information retrieval of tabular datasets.
“olympics,” the most common title word is “summer.” The Such indices support richer queries about tables that do
most frequently-occurring terms in the column names of not require the user to know the pre-existing structure
matching cells are “RANK” and “attempts”. The content of each table. They also provide the potential for new
histogram combines terms with numeric ranges. In par- exploratory interfaces, and we describe one that gives
ticular, the first, second, and fifth rows were all inserted users summaries of a dataset repository in terms of
tiby the numeric range processing (as described previously tles, content, and column names. The user can filter on
in Sect. 4.3). For this query, there are many cells with any of these facets to generate more specific summaries.
values between 0 and 4, and slightly fewer with values Future work will test the efectiveness of this novel
apbetween 4 and 21. The next most common content val- proach in facilitating dataset searches especially amongst
ues are the terms “olympics” and “summer.” Note, the non-expert users.
ifgure does not show the histogram for full titles that
corresponds to this query (but is still part of the
prototype interface). As discussed in the next paragraph, this Acknowledgments
histogram indicates how many matching cells are in each
dataset. This material is based upon work supported by the
Na</p>
        <p>The user can refine their query and create new his- tional Science Foundation under Grant No. III-1816325.
tograms by clicking on any terms in the result. For ex- Lixuan Qiu and Drake Johnson contributed to early drafts
ample, if the user clicks on “athens” in the content his- of this paper. We thank Alex Johnson, dePaul Miller,
togram (after scrolling down), the system will display Keith Register, and Xuewei Wang for contributions to
a new set of histograms summarizing the datasets that the system implementation.
have “athens” as a content field and “olympics” in the
title; in other words the query will be title=“olympics” and</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>1Many of our dataset collections do not have a URL recorded,</title>
      <p>which is why we do not simply link to the dataset as a result.
[11] A. Bozzon, M. Brambilla, S. Ceri, P. Fraternali,
Liquid query: Multi-domain exploratory search on
[1] A. Ianina, L. Golitsyn, K. Vorontsov, Multi-objective the web, in: Proceedings of the 19th International
topic modeling for exploratory search in tech news, Conference on World Wide Web, WWW ’10,
Asin: A. Filchenkov, L. Pivovarova, J. Žižka (Eds.), Ar- sociation for Computing Machinery, New York,
tificial Intelligence and Natural Language, Springer, NY, USA, 2010, p. 161–170. doi:10.1145/1772690.
2017, pp. 181–193. Communications in Computer 1772708.</p>
      <p>and Information Science, vol 789. [12] J. Fan, D. A. Keim, Y. Gao, H. Luo, Z. Li,
[2] H. Borchart, Efects of content preview on query Justclick: Personalized image recommendation via
refinement in dataset search, Senior Project Re- exploratory search from large-scale flickr images,
port, Cognitive Science Program, Lehigh University, IEEE Transactions on Circuits and Systems for
Bethlethem, PA, 2021. Video Technology 19 (2008) 273–288.
[3] L. Miller, Facilitating dataset search of non-expert [13] M. Dunaiski, G. J. Greene, B. Fischer, Exploratory
users through heuristic and systematic information search of academic publication and citation data
processing, Honors Thesis, Cognitive Science Pro- using interactive tag cloud visualizations,
Scientogram, Lehigh University, Bethlethem, PA, 2020. metrics 110 (2017) 1539–1571.
[4] D. Johnson, K. Register, B. D. Davison, J. Heflin, An [14] X. Zhang, D. Song, S. Priya, J. Heflin, Infrastructure
exploratory interface for dataset repositories using for eficient exploration of large scale linked data
cell-centric indexing, in: Proceedings of the 2020 via contextual tag clouds, in: International
SemanIEEE International Conference on Big Data (IEEE tic Web Conference, Springer, 2013, pp. 687–702.</p>
      <p>BigData 2020), 2020, pp. 5716–5718. Poster paper. [15] M. Singh, M. J. Cafarella, H. V. Jagadish,
Dbex[5] L. Qiu, H. Jia, B. D. Davison, J. Heflin, An architec- plorer: Exploratory search in databases, in: E.
Piture for cell-centric indexing of datasets, in: Pro- toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca,
ceedings of PROFILES’20: 7th International Work- I. Manolescu, K. Stefanidis (Eds.), Proceedings
shop on Dataset PROFILing and Search, 2020, pp. of the 19th International Conference on
Extend82–96. Held with ISWC 2020. ing Database Technology, EDBT 2016, Bordeaux,
[6] A. Chapman, E. Simperl, L. Koesten, G. Konstan- France, March 15-16, 2016, OpenProceedings.org,
tinidis, L.-D. Ibáñez, E. Kacprzak, P. Groth, Dataset 2016, pp. 89–100. doi:10.5441/002/edbt.2016.
search: a survey, The VLDB Journal 29 (2020) 251– 11.</p>
      <p>272. [16] R. W. White, R. A. Roth, Exploratory Search:
Be[7] N. Noy, M. Burgess, D. Brickley, Google dataset yond the Query-Response Paradigm, Synthesis
Lecsearch: Building a search engine for datasets in an tures on Information Concepts, Retrieval, and
Seropen Web ecosystem, in: Proceedings of The Web vices, Morgan &amp; Claypool Publishers, 2009. doi:10.</p>
      <p>Conference, 2019, pp. 1365–1375. 2200/S00174ED1V01Y200901ICR003.
[8] M. Derthick, J. Kolojejchick, S. F. Roth, An in- [17] S. Ferré, A. Hermann, Semantic search: Reconciling
teractive visual query environment for exploring expressive querying and exploratory search, in:
data, in: Proceedings of the 10th Annual ACM International Semantic Web Conference, Springer,
Symposium on User Interface Software and Tech- 2011, pp. 177–192.
nology, UIST ’97, Association for Computing Ma- [18] J. Peltonen, J. Strahl, P. Floréen, Negative
relechinery, New York, NY, USA, 1997, p. 189–198. vance feedback for exploratory search with visual
doi:10.1145/263407.263545. interactive intent modeling, in: Proceedings of
[9] S. Yogev, H. Roitman, D. Carmel, N. Zwerdling, To- the 22nd International Conference on Intelligent
wards expressive exploratory search over entity- User Interfaces, IUI ’17, Association for Computing
relationship data, in: Proceedings of the 21st Inter- Machinery, New York, NY, USA, 2017, p. 149–159.
national Conference on World Wide Web, WWW doi:10.1145/3025171.3025222.
’12 Companion, Association for Computing Machin- [19] T. Ruotsalo, J. Peltonen, M. J. A. Eugster,
ery, New York, NY, USA, 2012, p. 83–92. doi:10. D. Głowacka, P. Floréen, P. Myllymäki, G. Jacucci,
1145/2187980.2187990. S. Kaski, Interactive intent modeling for exploratory
[10] E. Koh, A. Kerne, R. Hill, Creativity support: In- search, ACM Trans. Inf. Syst. 36 (2018). doi:10.
formation discovery and exploratory search, in: 1145/3231593.</p>
      <p>Proceedings of the 30th Annual International ACM [20] C. Gormley, Z. Tong, Elasticsearch: the definitive
SIGIR Conference on Research and Development guide: a distributed real-time search and analytics
in Information Retrieval, SIGIR ’07, Association for engine, O’Reilly Media, Inc., 2015.
Computing Machinery, New York, NY, USA, 2007,
p. 895–896. doi:10.1145/1277741.1277963.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>