1. Introduction

with Preferences

Hamed Mirzaei

mirzaei@cs.ualberta.ca 0 1 2 3 4

Davood Rafiei

drafiei@ualberta.ca 0 1 2 3 4 0 Bases (VLDBW'23) - TaDA'23: Tabular Data Analysis Workshop 1 Figure 1: Task 1: Average F1 2 Table Union Search, Preference Queries, Information Retrieval , Table Augmentation, Table Unionability, Column Unionability 3 University of Alberta , 2-32 Athabasca Hall, Edmonton, Alberta , Canada 4 Workshop Proce dings

We study the problem of Table Union Search (TUS) in the presence of preferences. The notion of unionability, as studied in the literature, is too coarse to be efective in down-stream tasks. We introduce preferences for table unionability, as a way to reduce the search space and focus on rows and columns that are important for the follow-up operations. We show how these preferences can be eficiently implemented and how they can improve the performance of some down-stream tasks. Joint Workshops at 49th International Conference on Very Large Data ∗Corresponding author. †These authors contributed equally.

Our work falls under table-based approaches

1. Introduction

The vast corpus of relational tables on the web is a valuable resource for various applications such as table augmentation, knowledge base population, and question answering. Table Union Search (TUS) is an operation that aims to find relational tables on the web, a.k.a. webtables, that can be unioned with a query table. Two tables are considered unionable if their column values are drawn from the same domains. However, in the context of the web, the domains of columns are not known or fixed and existing approaches often rely on measures such as value overlap to find columns with the same domains, a.k.a unionable columns. But using this generic notion of unionability in downstream tasks can be challenging.

Consider a query table as shown in Table 1, and suppose we want to expand vertically or horizontally by adding more rows or columns. TUS returns the same list of webtables regardless of the follow-up operations. Also TUS fails to identify and leverage complex relationships between tables and columns, often returning near-duplicate webtables.

To address these issues, we introduce preferences to TUS, allowing for more eficient and efective retrieval of unionable webtables based on user-defined criteria. Preferences help in tailoring the results to diferent follow-up operations and mitigating the limitations of TUS. By incorporating these preferences, we aim to improve the usefulness and relevance of the TUS output for various applications. The contributions of this research are as follows: (D. Rafiei)

Country( 1)

Canada

USA China Iran

City( 2) Edmonton New York Shanghai Tehran

Population( 3) • We introduce four major preferences to the TUS operation: skyline, novelty, diversity and dependent set. • We introduce two benchmark datasets for unionable webtables, constructed from Web Data Commons 2015 [1] and WikiTables datasets [2]. • We propose eficient approaches for evaluating each preference.

2. Related Works

Tabular data may be searched using table-based queries [3, 4, 5, 6] and keyword-based approaches [7, 8, 9, 10]. Table-based approaches assume that the search query is a table while keyword-based assume it is a set of focusing on enriching the results of TUS.

Preferences have long been studied in databases (e.g., see the general survey [11], the foundational work [12] and preferences in SQL [13]) but we are not aware of them being applied to TUS. Khatiwada et al. [14] investigate the semantics and relationships of columns to identify unionable webtables more accurately. This may seem similar to our work on dependent set preference (to be discussed next), but there are some subtle diferences. of value overlap, and other metrics such as semantic (, ) (, ) refer to such pair as a unionability pair, shown as ↔ .

Definition 3.1 (Alignment).

two tables ( 1, 2, … , ) and (

An alignment between

1, 2, … , ) is a bijective mapping between all columns of and and is shown as

using the correspondThe goodness score of align- dominates a data point represented by vector

Definition 3.5 (TUScore).

The table unionability score of webtable with respect to query table is the maximum of alignments in (, ) search as the task that returns the top- most unionable candidate webtables for a query table (Definition 3.6).

Definition 3.6 (Top-k TUS). Given a query table

1, 2, … , ) with degree , Top-k TUS is the task of ifnding the top- candidate webtables in the corpus with the highest unionability score to .

To find the top-k TUS candidates, the “Weak And” algorithm (WAND) is utilized [15]. WAND is a safe-ranking document-at-a-time technique that prunes many candidate webtables early in the process without fully examining them. The WAND algorithm works over inverted indexes and achieves its eficiency by using a threshold score to limit the number of documents that need to be examined, drastically reducing the search time. To adapt WAND to this context, we consider candidate webtables as documents and each of their columns as a term. The weight of each column would be its Jaccard score with the query column under consideration. We refer to this implementation as TUS. We also add some filters such as candidate webtables should cover a key of query table to TUS and refer to it as TUS+. ((, )) = where |(, )| |(, )| ( ((, ))) (2) is

for the probability distribution corresponding to the size of alignment (, )

4. Preferences

4.1. Skyline Skyline preference has been extensively studied in the literature [16, 17]. It is commonly defined over a corpus of data points, each represented as a numeric vector, where one looks for those data points which are not dominated by others. A data point represented by vector and shown as ≻ if it has as good as or better values over all dimensions and a better value over at least one dimension [17]. Our approach is to represent each alignment (, )

as a vector of size equal to the number of query columns with each element showing the unionability score of a pair of columns. Following this idea, each candidate webtable with multiple alignments can be represented as a set of numeric vectors. An observation is of align- that for one candidate webtable, some of the alignments are a subset of the others which we prune them early in the process. Finally, by utilizing existing algorithms on skyline such as SaLSa [18] or BBS [19], we can return the finding webtables with the most diverse values over a candidate webtables with the best alignment over each subset of query columns. Hence, we can use the same subset of query columns (Definition

4.1).

Definition 4.1 (Top-k Skyline). With each alignment

between a candidate webtable and a query table represented as a numeric vector (as discussed), the Top-k Skyline of is the top- candidate webtables with the most number of skyline vectors. 4.2. Diversity Diversity is a well-studied preference in the literature with the purpose of resolving ambiguity by returning diverse results [20, 21, 22]. The proposed approaches usually involve a scoring function [22] that balances the diversity of the results while promoting their relevance to the search query.

Definition 4.2 (Top-k Diversity).

The Top-k Diversity of query table over subset of query columns is the list of webtables from dataset

which maximizes the following objective function and are sorted based on their TUScore. (, , ) = ⊆ ,||= argmax ( − 1).(1 − ). Rel(, , ) + .

Dif (, ), where the parameters and ( − 1) in this formulation control the scores’ contribution and to bring them to the same scale respectively, || ∑ =1 (, , ) = ||

( , ), is defined as the harmonic mean of the and (, ) query table. diferences of tables in from each other and from the

As finding such a list is a computationally hard task [22], we utilize a greedy approach based on Yu et al.’s

algorithm [23] which starts with a list of webtables and iterates over the rest of webtables to see if swapping outsiders with insiders improves the diversity score. We modified the

algorithm to choose the initial list more carefully and iterate over webtables in a specific (4) (5) order. 4.3. Novelty Novelty is a well-known preference which has been studied in the literature with the purpose of avoiding redundancy in the returned result [24, 25, 26, 22]. In the context of our work, we want to avoid redundant webtables and return those ofering as much unique new values as possible over a subset of query columns. It can be considered as a special case of diversity preference which aims at framework discussed for the diversity preference with some modifications in the proposed scores. The objective function for novelty consists of two scores, and The first score, , is the same as the one we defined for concatenate candidate columns in the same order we did for ′ to get to a new candidate column ′. The last step is to compute the table unionability score over the new set of candidate and query columns and to return the top-k ones to the user.

5. Evaluations

To assess the efectiveness of the proposed preferences, we evaluate their performance in two down-stream tasks: i) expanding query table by adding new rows and ii) extending it by adding new columns. As a measure of performance, we utilize Average F1-Measure@k, calculated by averaging the f1-measure@k scores over query tables. 5.1. Datasets

Existing datasets [6] only provide a limited selection of unionable candidates, and they fail to adequately showcase the impact of preferences. We introduce two datasets for our evaluation: (1) Web Data Commons 2015 (WDC) dataset [1], and (2) WikiTables dataset [2]. We built our Figure 2: Task 2: Average F1 Measure@k for query extension dataset of query and candidate webtables from these base over both datasets. tables by randomly selecting rows and random projection of columns. Applying these operations on a base table is expected to generate new tables that are unionable with Novelty returns webtables with more new rows for the the base table by the definition of unionability [ 27]. query table. For > 5 , dependent sets also outperform WDC 2015. We selected 50 webtable with at least 9 TUS and TUS+ in terms of performance. Interestingly, columns and 900 records from the WDC 2015 [1] dataset. some of the webtables have both new rows and new We checked them manually and pruned those with less columns for the query table, which explains why diverthan 5 text columns or with non-English content and sity and dependent sets excel in this task. ended up with 14 webtables as our base tables. We sampled 101 tables from each base table, as discussed above, 5.3. Task 2: Query Table Extension and designated one table as our query table and the remaining 100 tables as candidate webtables. This gave us TUS may include webtables unsuitable for extending the a total of 14 query tables and 1400 candidate webtables. query table with new columns. When all columns of a WikiTables. We did a similar process as the one for the webtable is unionabile with at least one query column, WDC dataset. We first selected top 100 webtables with at no new columns may be added to the query table. Figure least 9 columns and 300 rows from the WikiTables dataset 2 shows that diversity and novelty outperform TUS and [2]. Then, we pruned those with non-English content. In TUS+. Diversity focuses on returning webtables with addition, since we wanted to guarantee diferent levels more new columns. Dependent sets also outperform of unionability among query and candidate columns, we TUS and TUS+ for > 5 . pruned those tables with many columns that had only ‘yes/no’ values in their content. Finally we ended up 6. Conclusions with 12 base tables which we used to generate query and candidate webtables from. We ended up with 12 query tables, one per each base table, and 1200 candidate webtables, 100 candidates per each base table. 5.2. Task 1: Query Table Expansion TUS struggles in expanding the query table with additional rows. Figure 1 shows that novelty and diversity outperform other approaches such as TUS and TUS+.

Existing approaches for finding unionable tables primarily focus on eficiently providing an approximate ranked list of candidate tables to the user, which may limit access to suitable tables for follow-up operations. In this paper, we explore the power of preferences, showing that users can retrieve tables that are not only unionable with the query table but are also suitable for various follow-up operations. A possible future direction is exploring more eficient approaches to support preferences with TUS. retrieval process, in: Proceedings of the twelfth international conference on Information and knowl[1] O. Lehmberg, D. Ritze, R. Meusel, C. Bizer, A large edge management, 2003, pp. 426–434. public corpus of web tables containing time and [16] C. Kalyvas, T. Tzouramanis, A survey of skyline context metadata, in: Proceedings of the 25th In- query processing, arXiv preprint arXiv:1704.01788 ternational Conference Companion on World Wide (2017).

Web, 2016, pp. 75–76. [17] J.-H. Choi, F. Hao, A. Nasridinov, Hi-sky: Hash [2] C. S. Bhagavatula, T. Noraset, D. Downey, Methods index-based skyline query processing, Applied Scifor exploring and mining tables on wikipedia, in: ences 10 (2020) 1708.

Proceedings of the ACM SIGKDD workshop on [18] I. Bartolini, P. Ciaccia, M. Patella, Salsa: Computinteractive data exploration and analytics, 2013, pp. ing the skyline without scanning the whole sky, in: 18–26. Proceedings of the 15th ACM international confer[3] S. Zhang, K. Balog, Recommending related tables, ence on Information and knowledge management, arXiv preprint arXiv:1907.03595 (2019). 2006, pp. 405–414. [4] Y. Zhang, Z. G. Ives, Finding related tables in data [19] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive lakes for interactive data science, in: Proceedings of skyline computation in database systems, ACM the 2020 ACM SIGMOD International Conference Transactions on Database Systems (TODS) 30 (2005) on Management of Data, 2020, pp. 1951–1966. 41–82. [5] T. Cong, H. Jagadish, Pylon: Table union search [20] D. Rafiei, K. Bharat, A. Shukla, Diversifying web through contrastive representation learning, arXiv search results, in: Proceedings of the 19th interpreprint arXiv:2301.04901 (2023). national conference on World wide web, 2010, pp. [6] F. Nargesian, E. Zhu, K. Q. Pu, R. J. Miller, Table 781–790.

union search on open data, Proceedings of the [21] M. R. Vieira, H. L. Razente, M. C. Barioni, M. HadVLDB Endowment 11 (2018) 813–825. jieleftheriou, D. Srivastava, C. Traina, V. J. Tsotras, [7] R. Pimplikar, S. Sarawagi, Answering table queries On query result diversification, in: 2011 IEEE on the web using column keywords, arXiv preprint 27th International Conference on Data Engineering, arXiv:1207.0132 (2012). IEEE, 2011, pp. 1163–1174. [8] L. Deng, Table2Vec: Neural word and entity embed- [22] M. R. Vieira, H. L. Razente, M. C. Barioni, M. Haddings for table population and retrieval, Master’s jieleftheriou, D. Srivastava, C. Traina Jr, V. J. Tsothesis, University of Stavanger, Norway, 2018. tras, Divdb: A system for diversifying query re[9] Z. Chen, M. Trabelsi, J. Heflin, Y. Xu, B. D. Davison, sults, Proceedings of the VLDB Endowment 4 (2011) Table search using a deep contextualized language 1395–1398. model, in: Proceedings of the 43rd International [23] C. Yu, L. Lakshmanan, S. Amer-Yahia, It takes vaACM SIGIR Conference on Research and Develop- riety to make a world: diversification in recomment in Information Retrieval, 2020, pp. 589–598. mender systems, in: Proceedings of the 12th in[10] A. Bogatu, A. A. Fernandes, N. W. Paton, N. Kon- ternational conference on extending database techstantinou, Dataset discovery in data lakes, in: 2020 nology: Advances in database technology, 2009, pp. IEEE 36th International Conference on Data Engi- 368–378.

neering (ICDE), IEEE, 2020, pp. 709–720. [24] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechto[11] K. Stefanidis, G. Koutrika, E. Pitoura, A survey mova, A. Ashkan, S. Büttcher, I. MacKinnon, Novon representation, composition and application of elty and diversity in information retrieval evalpreferences in database systems, ACM Transactions uation, in: Proceedings of the 31st annual inon Database Systems (TODS) 36 (2011) 1–45. ternational ACM SIGIR conference on Research [12] P. Ciaccia, D. Martinenghi, R. Torlone, Foundations and development in information retrieval, 2008, pp. of context-aware preference propagation, Journal 659–666.

of the ACM (JACM) 67 (2020) 1–43. [25] P. Castells, N. Hurley, S. Vargas, Novelty and diver[13] W. Kießling, M. Endres, F. Wenzel, The preference sity in recommender systems, in: Recommender sql system-an overview, IEEE Data Engineering systems handbook, Springer, 2022, pp. 603–646.

Bulletin 34 (2011) 12–19. [26] T. Ghosal, T. Saikh, T. Biswas, A. Ekbal, P. Bhat[14] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gat- tacharyya, Novelty detection: A perspective from terbauer, R. J. Miller, M. Riedewald, Santos: natural language processing, Computational LinRelationship-based semantic table union search, guistics 48 (2022) 77–117.

arXiv preprint arXiv:2209.13589 (2022). [27] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, [15] A. Z. Broder, D. Carmel, M. Herscovici, A. Sofer, W. Shen, F. Wu, G. Miao, Recovering semantics of J. Zien, Eficient query evaluation using a two-level tables on the web (2011).