=Paper=
{{Paper
|id=Vol-2950/paper-09
|storemode=property
|title=Exploring Datasets via Cell-Centric Indexing
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-09.pdf
|volume=Vol-2950
|authors=Jeff Heflin,Brian D. Davison,Haiyan Jia
|dblpUrl=https://dblp.org/rec/conf/desires/Heflin0J21
}}
==Exploring Datasets via Cell-Centric Indexing==
Exploring Datasets via Cell-Centric Indexing Jeff Heflin1 , Brian D. Davison1 and Haiyan Jia2 1 Computer Science & Engineering, Lehigh University, 113 Research Dr., Bethlehem, PA, 18015, USA 2 Journalism and Communication, Lehigh University, 33 Coppee Dr., Bethlehem, PA, 18015, USA Abstract We present a novel approach to dataset search and exploration. Cell-centric indexing is a unique indexing strategy that enables a powerful, new interface. The strategy treats individual cells of a table as the indexed unit, and combining this with a number of structure-specific fields enables queries that cannot be answered by a traditional indexing approach. Our interface provides users with an overview of a dataset repository, and allows them to efficiently use various facets to explore the collection and identify datasets that match their interests. Keywords cell-centric indexing, dataset search, exploratory interface, 1. Introduction not be able to determine if the table is a perfect match until they have downloaded the (potentially large) table. The twenty-first century has experienced an informa- Studies have shown that understanding what is inside tion explosion; data is growing exponentially and users’ the content of a dataset, rather than simply the dataset information retrieval needs are becoming much more descriptions and metadata, could be critical for their eval- complicated [1]. Given people’s increasing interests in uation of whether any of the search results sufficiently datasets, there is a need for user-friendly search services matches the search need, especially for non-expert users. for data journalists, scientists, decision makers, and the For instance, a recent user study [2] has revealed that general public to locate datasets that can meet their data query refinement, as a result of unsatisfying search re- needs. sults, is negatively associated with user experience with Even though users, under many circumstances, are the dataset search tools. What reduces the need for query not experts in the domain in which they search, they refinement is a preview of the dataset content, which should be able to easily use such an application; the query helps users gauge the relevance of the datasets. Simi- process should be responsive and efficient. The result larly, an experimental study that explores novel dataset should provide a general picture of what the dataset is search engine prototypes has found that interfaces with about, and offer enough information for the searcher to a content preview feature were perceived as more usable. know how likely the dataset will contain data that they In particular, non-expert users reported greater benefits look for. from the content preview, as they rated the interfaces Traditional database management systems group data with higher levels of usefulness, ease of use, usability, by tables and then organize this data into rows and and technology adoption intention, than expert users columns. When users are aware of the database schema, [3]. These indicate the strong need for understanding they can construct queries, but what if users are simply the actual content of datasets, even at the cell level. trying to find which tables in a large data lake are rel- To enable sufficient query refinement for schema- evant to their needs? One approach is to simply index optional queries, we present the novel concept of cell- information about the table in various fields: e.g., title, centric indexing. The key idea is that we use individuals description, columns, etc. While this approach may be cells of a table as the fundamental unit and build inverted sufficient for some queries, in many cases the user will indices on these cells. These indices provide different fields that index both the content of the cell and its con- DESIRES 2021 – 2nd International Conference on Design of text. For our purposes, the context includes other cell Experimental Search & Information REtrieval Systems, September values in the same row, the name of the column (if avail- 15–18, 2021, Padua, Italy able), and metadata about the containing dataset. This " heflin@cse.lehigh.edu (J. Heflin); davison@cse.lehigh.edu (B. D. approach allows us to refine our search by row descrip- Davison); haiyan.jia@lehigh.edu (H. Jia) ~ http://www.cse.lehigh.edu/~heflin/ (J. Heflin); tors, column descriptors or both at the same time. In http://www.cse.lehigh.edu/~brian/ (B. D. Davison); essence we free the data from how it is structured, and https://journalism.cas.lehigh.edu/content/haiyan-jia (H. Jia) schema information, when available, is merely one of the 0000-0002-7290-1495 (J. Heflin); 0000-0002-9326-3648 (B. D. many ways to locate the data of interest. Thus, we take Davison); 0000-0002-8388-7860 (H. Jia) the view that fundamentally, users are searching for spe- © 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) cific data (i.e., particular cells or collections thereof), and the tables are merely artifacts of how the data is stored. Derthick et al. [8] describe a visual query language We recognize that this approach also has downsides. that dynamically links queries and visualizations. The In particular, an index of cells (and their contexts) will system helps a user to locate information in a multi- incur substantial storage overhead in comparison to an object database, illustrates the complicated relationship index of dataset metadata. Moreover, if the desired search between attributes of multiple objects, and assists the result is one or more datasets, at run-time there will be user to clearly express their information retrieval needs additional processing to assemble the cell-specific results in their queries. Similarly, Yogev et al. [9] demonstrate an to enable the retrieval and ranking at that level of granu- exploratory search approach for entity-relationship data, larity. However, our cell-centric approach gives us some combining an expressive query language, exploratory additional flexibility and we believe that good system de- search, and entity-relationship graph navigation. Their sign, appropriate data structures, and efficient algorithms work enables people with little to no query language can ameliorate the costs. expertise to query rich entity-relationship data. This paper incorporates material previously presented In the domain of web search, Koh et al. [10] devise a in a poster [4] and a workshop [5]. Our contributions of user interface that supports creativity in education and this paper are: research. The model allows users to send their query to their desired commercial search engine or social platform • We propose cell-centric indexing as an innovative in iterations. As the system goes through each iteration, approach to an information retrieval system. A it will combine the text and image results into a compo- cell-centric index enables a user to find data with- sition space. Addressing a similar problem, Bozzon et out having to know the pre-existing structure of al. [11] design an interactive user interface that employs each table; exploratory search and Yahoo! Query Language (YQL) • We describe the mechanisms of one implementa- to empower users to iteratively investigate results across tion of a cell-centric dataset search engine. We multiple sources. describe the structure and method of data storage A tag cloud is a common and useful visualization of and querying of our server; and, data that represents relative importance or frequency via • We describe a novel prototype interface that lever- size. Some researchers have adapted this idea to visualize ages cell-centric indexing in order to give users query results. Fan et al. [12] focus on designing an in- summaries of a dataset repository in terms of ti- teractive user interface with image clouds. The interface tles, content, and column names. The user can enables users to comprehend their latent query inten- filter on any of these facets to generate more spe- tions and direct the system to form their personalized cific summaries. image recommendations. Dunaiski et al. [13] design and evaluate a search engine that incorporates exploratory The rest of the paper is organized as follows: we first search to ease researchers’ scouting for academic publica- discuss related work, briefly describe the idea of cell- tions and citation data. Its user interface unites concept centric indexing and its advantages and disadvantages, lattices and tag clouds to present the query result in introduce the structure of our server and the methodol- a readable composition to promote further exploratory ogy involved in querying, and finally describe a prototype search. On the other hand, Zhang et al. focus their work interface. on knowledge graph data [14]. They combine faceted browsing with contextual tag clouds to create a system 2. Related Work that allows users to rapidly explore graphs with billions of edges by visualizing conditional dependencies between Scholars have investigated exploratory search to help selected classes and other data. Although they don’t use searchers succeed in an unfamiliar area by proposing tag clouds, Singh et al. [15] also display conditional de- novel information retrieval algorithms and systems; some pendencies in their data outline tool. For a given pivot of them propose innovative user interfaces, while oth- attribute and set of automatically determined compare ers try to predict the user’s information need and to attributes, they show conditional values, grouped into use the prediction to better facilitate the subsequent in- clusters of interaction units. teraction. Chapman et al. [6] have reviewed different Other scholars have investigated query languages and approaches to dataset search. Google’s dataset search models. Ianina et al. [1] concentrate on developing an [7] is an example of a traditional approach to indexing exploratory search system that facilitates the user hav- web datasets: the system crawls the Web and indexes ing a way to conduct long text queries, while minimiz- dataset that have metadata expressed in the schema.org ing the risk of returning empty results, since the itera- (or a related) format. The only required properties are tive “query–browse–refine” process [16] may be time- name and description. consuming and require expertise. Meanwhile, Ferré and Hermann [17] focus more on the query language, LISQL, and they offer a search system that integrates LISQL and table as the indexed object, each datum (cell in the table) faceted search. The system helps users to build complex is an indexed object. In its simplest form, we propose queries and enlightens users about their position in the four fields: content: the value of the cell, title: the label of data navigation process. the dataset the cell appears in, columnName: the header Yet another approach is to predict the user’s search of the column the cell appears in, and rowContext: the intent so that better search results can be presented. Pel- values in all cells in the same row as the indexed cell. For- tonen et al. [18] utilize negative relevance feedback in mally, a cell value 𝑉𝑖,𝑗 from table 𝑇 = ⟨𝑙, 𝐻, 𝑉 ⟩ can be in- an interactive intent model to direct the search. Negative dexed with: content⋃︀ = 𝑉𝑖,𝑗 , title = 𝑙, columnName = ℎ𝑗 , relevance feedback predicts the most relevant keywords, and rowcontext = 𝑛 𝑘=1 𝑉𝑖,𝑘 . This index would allow which are later arranged in a radar graph where the users to find all cells that have a column header token in center denotes the user, to represent the user’s intent. common regardless of dataset, or all cells that appear in Likewise, Ruotsalo et al. [19] propose a similar intent the same row as some identifying token, or look for the radar model that predicts a user’s next query in an inter- occurrence of specific values in specific columns. active loop. The model uses reinforcement learning to However, in this form, users still need to know which control the exploration and exploitation of the results. keywords to use and which fields to use them in. A cell- centric index alone is not helpful to a user who is not already familiar with the collection of datasets. In order 3. Cell-Centric Indexing to support the user in exploring the data, we propose the abstraction conditional frequency vectors (CFVs). Let We define a table as 𝑇 = ⟨𝑙, 𝐻, 𝑉 ⟩ where 𝑙 is the label 𝐼 be a set of items, 𝐷 be a set of descriptors (e.g., tags of the table, 𝐻 = ⟨ℎ1 , ℎ2 , ..., ℎ𝑛 ⟩ is a list of the column that describe the items), and 𝐹 ⊆ 𝐼 × 𝐷 be a set of item headers, and 𝑉 is an 𝑚×𝑛 matrix of the values contained and descriptor pairs ⟨𝑥𝑖 , 𝑑𝑖 ⟩. Let 𝑄 be a query, where in the table. 𝑉𝑖,𝑗 refers to the value in the 𝑖-th row and 𝑄(𝐹 ) ⊆ 𝐹 represents the pairs for only those items that the 𝑗-th column, which has the heading ℎ𝑗 . We note match 𝑄. Then a CFV for 𝑄 and 𝐹 is a set of descriptor- that this model can be easily extended to include other frequency pairs where the frequency is the number of metadata, as appropriate. times that the corresponding descriptor occurs within A naïve approach to indexing a collection of datasets 𝑄(𝐹 ): {⟨𝑑, 𝑓 ⟩ | 𝑓 = #{⟨𝑥, 𝑑⟩| ⟨𝑥, 𝑑⟩ ∈ 𝑄(𝐹 )}}. For would be to simply treat each table as a document, and cell-centric indexing, the items 𝐼 are the set of all cells re- have separate fields for the label, column headings, and gardless of source dataset, and 𝐹𝑖 pairs cells with terms (possibly) values. When terms are used consistently and from the 𝑖-th field. For example, if a cell 𝑐5 was in a the user is familiar with the terminology, this may work column titled "Real Estate Price," then 𝐹𝑐𝑜𝑙𝑢𝑚𝑛𝑁 𝑎𝑚𝑒 in- well. However, this approach has several weaknesses: cludes the pairs ⟨𝑐5 , 𝑟𝑒𝑎𝑙⟩, ⟨𝑐5 , 𝑒𝑠𝑡𝑎𝑡𝑒⟩, and ⟨𝑐5 , 𝑝𝑟𝑖𝑐𝑒⟩. • Any query on values has lost context of what Typically, we sort the CFV in terms of descending fre- column the value appears in and what identify- quency. ing information might be present elsewhere in the same row. For example, a table that contains capitals like (Paris, France) and (Austin, Texas) 4. System Architecture is unlikely to be relevant to a query about “Paris Texas” but would otherwise match. The architecture of the system is depicted in Figure 1. At • It is difficult to determine which new terms can the core of our system is an Elasticsearch server. Elastic- be used to refine the query. Users would need search [20] is a scalable, distributed search engine that to download some of the datasets and choose also supports complex analytics. Our system has two distinctive terms from the most relevant ones. main functions: 1) parse collections of datasets, map • A user’s constraint could be represented in dif- them into the fields of a cell-centric index, and send index- ferent tables in very different ways. If the user ing requests to Elasticsearch; and, 2) given a user query, is looking for “California Housing Prices”, there issue a series of queries to Elasticsearch and construct may be a table with some variant of that name, histograms (CFVs) for each field. The Query Processor there may be a “Real Estate Prices” table with translates our high-level query API into specific Elastic- rows specific to California, or there may be a Search queries, and assembles the results into CFVs. “Housing Prices” table that has a column for each state, including California. A user should be able 4.1. Index Definition to explore the collection to see how the data is organized and what terminology is used. In Elasticsearch, a mapping defines how a document will be indexed: what fields will be used and how they will We have proposed cell-centric indexing as a novel way be processed. In cell-centric indexing the cell is the to address the problems above. Rather than treating the 4.2. Indexing a Dataset The system loads each dataset using the following pro- cess: 1. Read the metadata, which can include title, tags, notes and organization. If the original table is for- matted as CSV, then this data might be contained in a separate file in the same directory, or as a row in a repository index file. If the table is formatted using JSON, the metadata may be specified along with the content, and there may be many datasets described in a single file. 2. Read the column headings ⟨ℎ1 , ℎ2 , ..., ℎ𝑛 ⟩ 3. For each row in the dataset: a) Read the row values: ⟨𝑣1 , 𝑣2 , ..., 𝑣𝑛 ⟩ b) Create rowContext by concatenating the values in the row. Note, to avoid creat- Figure 1: High-level system architecture ing different large context strings for each value in the row, we create a single row- Context. This means that each value is also part of its own row context. This decision document, and our index must have fields that describe helps make the system more efficient. An cells. Our mapping is summarized in Table 1. In addition additional efficiency consideration is that to the four fields mentioned in Section 3, we have fields each value included in rowContext is trun- for the fullTitle (used to identify which specific datasets cated to the first 100 characters. match the query) and metadata such as tags, notes, or- c) Build an index request for each cell value ganization, and setId. The setId allows us to distinguish 𝑣𝑖 . If the content is numeric (integer or between different datasets with the same title, and to get real), it will be indexed in the contentNu- an accurate count of how many datasets match a query. meric field; otherwise it is indexed in the Note, that content is divided into two fields: content and content field. The columnName field will contentNumeric, for reasons that will be described below. be indexed with the corresponding header For each field, we give its type and, if applicable, the ℎ𝑖 . The title is indexed twice, once as a analyzer used to process text from the field. tokenized field that can be used in queries, We use three types of fields: text, keyword, and double. and again as a keyword field that preserves Text type fields are tokenized and processed by word the order of the title and can be used to analyzers, whereas keyword type fields are indexed as precisely identify the dataset the cell origi- is (without tokenization or further processing). Double nated from. All other metadata fields are type fields are used to store 64-bit floating point numbers. indexed in a straight-forward way. Most of our fields are text fields, but contentNumeric is a double field, which allows it to store both integer and real numeric values, and both fullTitle and setid are keyword Field Type Analyzer fields, since we want users to be able to view the complete columnName text wordDelimiter name of the dataset in the result, and there is no need to content text stop parse setIds. contentNumeric double N/A All text fields require an analyzer which determines rowContext text stop how to tokenize the field and if any additional processing title text stop is required. We use two built-in Elasticsearch analyzers: fullTitle keyword N/A the stop analyzer divides text at all non-letter characters tags text stop and removes 33 stop words (such as “a”, “the”, “to”, etc.). notes text stop For most fields, we use the stop analyzer, but we use organization text stop setid keyword N/A the wordDelimiter analyzer for the colunnName field. In addition to dividing text at all non-letter characters, it Table 1 also divides text at letter case transitions (e.g., “birthDate” Elasticsearch mappings used to implement cell-centric index- is tokenized to “birth” and “date”). This analyzer does ing not remove stop words. Figure 2: Initial pre-query histograms d) For efficiency, index requests are batched 5. If 𝑎 < 𝜇 − 1.5𝜎 and 𝑏 > 𝜇 + 1.5𝜎 (i.e., numeric and sent to the server in bulk. Synchro- data is not particularly skewed), build an aggrega- nization is disabled in ElasticSearch during tion query for contentNumeric data using ranges bulk loading to avoid excessive delays. calculated from 𝜇 and 𝜎: the lowest range is [𝑎, 𝜇 – 1.5𝜎] and the highest is [𝜇 + 1.5𝜎, 𝑏], where there are 3 intermediate ranges of uniform size 4.3. Querying the index with the middle range 𝜇 – 0.5𝜎, 𝜇 + 0.5𝜎]. If data Our Query Processor takes a conjunctive, fielded query are skewed, the ranges are shifted appropriately. and returns a histogram for each response field. The re- 6. Issue an Elasticsearch histogram aggregation sponse fields are fields that contain information that helps query with the calculated ranges. Treat each the user understand the characteristics of cells that match range as a content term, and insert these terms the query. Currently, response fields are title, column- and their frequencies into the the content CFV. Name, content, rowContext, and fullTitle. Given a query 7. Return CFVs for each response field. 𝑞, the query process is: Much of the processing above allows the system to 1. Issue query 𝑞 requesting term aggregations for ti- dynamically determine buckets for numeric content that tle, columnName, content, rowContext and fullTi- provide a useful picture of its distribution. Unlike textual tle, Term aggregations are a feature of Elastic- terms, numeric terms exhibit greater variability. His- search that return a list of terms that appear in the tograms built using distinct numeric strings are unlikely selected documents, along with their frequency, to have significant value. For example, “135”, “135.0” and i.e., CFV’s for 𝑞. “1.35E+2” are all equivalent, while many users might con- 2. Calculate the min 𝑎 and max 𝑏 for matching con- sider “135.0001” to be close enough. To address this, we tentNumeric data. create ranges over numeric values. Our approach com- 3. Select a representative set 𝑁 of matching numeric putes the mean and standard deviation over the middle content by issuing a percentile query against the 90% of data, thus removing the influence of outliers, and contentNumeric field that excludes the top and then specifies the buckets to have a width of one standard bottom 5 percent of the data. deviation with one bucket centered over the mean. Once the histogram of numeric ranges is created, its data is 4. Calculate the mean 𝜇 and standard deviation 𝜎 merged with the content histogram to produce a single of set 𝑁 . histogram that shows frequencies of textual terms and Figure 3: Search results with query: title=olympics numbers within ranges that depend both on the dataset in the content histogram. Once this is added the user and the query. might direct themselves to the full title histogram shown in Fig. 4. There the user can find a dataset titled “Kenya at the Olympics Medalists”. To gain access to the dataset 5. Prototype User Interface the user must add the full title to their query. Once a full title is in the query a button appears that performs a An example of a typical use case is demonstrated using Google query of the full title. Since this specific dataset is Figures 2-4. In this specific case, the user wants to find a from WikiTables, the Google query will provide a link to dataset containing data on Kenya’s performance in the the Wikipedia page containing the table. We now discuss 2004 Athens Olympics. Initially, the user is presented specific interface components in more detail. with the graphs in Fig. 2. These histograms show the most frequent title and column terms in the collection of indexed datasets. However, the example histograms 5.1. Pre-query Histograms do not initially show anything regarding the Olympics. Before any search parameters are set, the user is shown By using the “More Items” button at the bottom of the two pre-query histograms that return up to the 50 most title histogram the user can find the term Olympics and frequent title and column name tokens within the current add it directly to their query. After this term is added the repository (see Fig. 2). Column name and title histograms screen changes to that shown in Fig. 3. The user can now provide a good overview and are vital in allowing the look through all 4 histograms and decide which term best user to explore the datasets without prior knowledge of helps them get to their desired data. Once again using the contents. The pre-query histograms are presented to the “More Items” button, the term Athens can be found the user when there are no active queries, such as when the page is initially loaded or when all queries have been deleted. Clicking on a histogram bar will automatically add the corresponding term to the query and generate the standard set of histograms. 5.2. Results Histograms The standard screen displays the user’s current query and five histograms. Each histogram is associated with a field, and tokens are sorted in descending frequency of co- occurrence with the query. The length of a bar indicates how many cells match the query. As with the pre-query histograms, clicking on a bar adds the associated term to the query, and generates a new result histogram. Below Figure 4: Results of Full Title Histogram with query: each histogram is an option to provide more results on the title=olympics, content=athens histogram. Initially, each histogram presents the top 10 results, however, the top 25 results are pre-fetched which allows the newly requested results to be automatically added to the histogram. content=“athens”. For this refinement, we show the Full Due to the connection between the count of matched Title histogram (see Fig. 4). In this histogram, the bars cells and bar length, there is the possibility that the first represent the number of cells in a data set that match the bar will be significantly larger than all remaining bars, user’s query. The user can add this bar to the query to get making them difficult to see or select. To combat this, specific information about the distribution of terms in we compare the counts of the two most frequent results. the chosen dataset. Additionally, this enables the option If the first result contains 10 times more hits than the to search for the dataset, which is accomplished using a second most frequent we change the scale of the his- Google query of the dataset’s full title.1 The user can con- tograms to logarithmic, thus making it easier to visualize tinue to explore the dataset collection by adding terms distinctions in skewed distributions. to and removing terms from the query. Figure 3 shows the response of our prototype inter- face to the query with title=“olympics”. It displays a 6. Conclusion CFV for each field as a histogram; the longer the red bar, the more frequently that term co-occurs with the query. We have proposed cell-centric indexing as an innova- As we can see, 318 datasets contain matches, and after tive approach to information retrieval of tabular datasets. “olympics,” the most common title word is “summer.” The Such indices support richer queries about tables that do most frequently-occurring terms in the column names of not require the user to know the pre-existing structure matching cells are “RANK” and “attempts”. The content of each table. They also provide the potential for new histogram combines terms with numeric ranges. In par- exploratory interfaces, and we describe one that gives ticular, the first, second, and fifth rows were all inserted users summaries of a dataset repository in terms of ti- by the numeric range processing (as described previously tles, content, and column names. The user can filter on in Sect. 4.3). For this query, there are many cells with any of these facets to generate more specific summaries. values between 0 and 4, and slightly fewer with values Future work will test the effectiveness of this novel ap- between 4 and 21. The next most common content val- proach in facilitating dataset searches especially amongst ues are the terms “olympics” and “summer.” Note, the non-expert users. figure does not show the histogram for full titles that corresponds to this query (but is still part of the proto- type interface). As discussed in the next paragraph, this Acknowledgments histogram indicates how many matching cells are in each dataset. This material is based upon work supported by the Na- The user can refine their query and create new his- tional Science Foundation under Grant No. III-1816325. tograms by clicking on any terms in the result. For ex- Lixuan Qiu and Drake Johnson contributed to early drafts ample, if the user clicks on “athens” in the content his- of this paper. We thank Alex Johnson, dePaul Miller, togram (after scrolling down), the system will display Keith Register, and Xuewei Wang for contributions to a new set of histograms summarizing the datasets that the system implementation. have “athens” as a content field and “olympics” in the ti- 1 Many of our dataset collections do not have a URL recorded, tle; in other words the query will be title=“olympics” and which is why we do not simply link to the dataset as a result. References [11] A. Bozzon, M. Brambilla, S. Ceri, P. Fraternali, Liq- uid query: Multi-domain exploratory search on [1] A. Ianina, L. Golitsyn, K. Vorontsov, Multi-objective the web, in: Proceedings of the 19th International topic modeling for exploratory search in tech news, Conference on World Wide Web, WWW ’10, As- in: A. Filchenkov, L. Pivovarova, J. Žižka (Eds.), Ar- sociation for Computing Machinery, New York, tificial Intelligence and Natural Language, Springer, NY, USA, 2010, p. 161–170. doi:10.1145/1772690. 2017, pp. 181–193. Communications in Computer 1772708. and Information Science, vol 789. [12] J. Fan, D. A. Keim, Y. Gao, H. Luo, Z. Li, [2] H. Borchart, Effects of content preview on query Justclick: Personalized image recommendation via refinement in dataset search, Senior Project Re- exploratory search from large-scale flickr images, port, Cognitive Science Program, Lehigh University, IEEE Transactions on Circuits and Systems for Bethlethem, PA, 2021. Video Technology 19 (2008) 273–288. [3] L. Miller, Facilitating dataset search of non-expert [13] M. Dunaiski, G. J. Greene, B. Fischer, Exploratory users through heuristic and systematic information search of academic publication and citation data processing, Honors Thesis, Cognitive Science Pro- using interactive tag cloud visualizations, Sciento- gram, Lehigh University, Bethlethem, PA, 2020. metrics 110 (2017) 1539–1571. [4] D. Johnson, K. Register, B. D. Davison, J. Heflin, An [14] X. Zhang, D. Song, S. Priya, J. Heflin, Infrastructure exploratory interface for dataset repositories using for efficient exploration of large scale linked data cell-centric indexing, in: Proceedings of the 2020 via contextual tag clouds, in: International Seman- IEEE International Conference on Big Data (IEEE tic Web Conference, Springer, 2013, pp. 687–702. BigData 2020), 2020, pp. 5716–5718. Poster paper. [15] M. Singh, M. J. Cafarella, H. V. Jagadish, Dbex- [5] L. Qiu, H. Jia, B. D. Davison, J. Heflin, An architec- plorer: Exploratory search in databases, in: E. Pi- ture for cell-centric indexing of datasets, in: Pro- toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, ceedings of PROFILES’20: 7th International Work- I. Manolescu, K. Stefanidis (Eds.), Proceedings shop on Dataset PROFILing and Search, 2020, pp. of the 19th International Conference on Extend- 82–96. Held with ISWC 2020. ing Database Technology, EDBT 2016, Bordeaux, [6] A. Chapman, E. Simperl, L. Koesten, G. Konstan- France, March 15-16, 2016, OpenProceedings.org, tinidis, L.-D. Ibáñez, E. Kacprzak, P. Groth, Dataset 2016, pp. 89–100. doi:10.5441/002/edbt.2016. search: a survey, The VLDB Journal 29 (2020) 251– 11. 272. [16] R. W. White, R. A. Roth, Exploratory Search: Be- [7] N. Noy, M. Burgess, D. Brickley, Google dataset yond the Query-Response Paradigm, Synthesis Lec- search: Building a search engine for datasets in an tures on Information Concepts, Retrieval, and Ser- open Web ecosystem, in: Proceedings of The Web vices, Morgan & Claypool Publishers, 2009. doi:10. Conference, 2019, pp. 1365–1375. 2200/S00174ED1V01Y200901ICR003. [8] M. Derthick, J. Kolojejchick, S. F. Roth, An in- [17] S. Ferré, A. Hermann, Semantic search: Reconciling teractive visual query environment for exploring expressive querying and exploratory search, in: data, in: Proceedings of the 10th Annual ACM International Semantic Web Conference, Springer, Symposium on User Interface Software and Tech- 2011, pp. 177–192. nology, UIST ’97, Association for Computing Ma- [18] J. Peltonen, J. Strahl, P. Floréen, Negative rele- chinery, New York, NY, USA, 1997, p. 189–198. vance feedback for exploratory search with visual doi:10.1145/263407.263545. interactive intent modeling, in: Proceedings of [9] S. Yogev, H. Roitman, D. Carmel, N. Zwerdling, To- the 22nd International Conference on Intelligent wards expressive exploratory search over entity- User Interfaces, IUI ’17, Association for Computing relationship data, in: Proceedings of the 21st Inter- Machinery, New York, NY, USA, 2017, p. 149–159. national Conference on World Wide Web, WWW doi:10.1145/3025171.3025222. ’12 Companion, Association for Computing Machin- [19] T. Ruotsalo, J. Peltonen, M. J. A. Eugster, ery, New York, NY, USA, 2012, p. 83–92. doi:10. D. Głowacka, P. Floréen, P. Myllymäki, G. Jacucci, 1145/2187980.2187990. S. Kaski, Interactive intent modeling for exploratory [10] E. Koh, A. Kerne, R. Hill, Creativity support: In- search, ACM Trans. Inf. Syst. 36 (2018). doi:10. formation discovery and exploratory search, in: 1145/3231593. Proceedings of the 30th Annual International ACM [20] C. Gormley, Z. Tong, Elasticsearch: the definitive SIGIR Conference on Research and Development guide: a distributed real-time search and analytics in Information Retrieval, SIGIR ’07, Association for engine, O’Reilly Media, Inc., 2015. Computing Machinery, New York, NY, USA, 2007, p. 895–896. doi:10.1145/1277741.1277963.