<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>with Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hamed Mirzaei</string-name>
          <email>mirzaei@cs.ualberta.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davood Rafiei</string-name>
          <email>drafiei@ualberta.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bases (VLDBW'23) - TaDA'23: Tabular Data Analysis Workshop</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Figure 1: Task 1: Average F1</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Table Union Search, Preference Queries, Information Retrieval</institution>
          ,
          <addr-line>Table Augmentation, Table Unionability, Column Unionability</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Alberta</institution>
          ,
          <addr-line>2-32 Athabasca Hall, Edmonton, Alberta</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study the problem of Table Union Search (TUS) in the presence of preferences. The notion of unionability, as studied in the literature, is too coarse to be efective in down-stream tasks. We introduce preferences for table unionability, as a way to reduce the search space and focus on rows and columns that are important for the follow-up operations. We show how these preferences can be eficiently implemented and how they can improve the performance of some down-stream tasks. Joint Workshops at 49th International Conference on Very Large Data ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Our work falls under table-based approaches</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The vast corpus of relational tables on the web is a
valuable resource for various applications such as table
augmentation, knowledge base population, and question
answering. Table Union Search (TUS) is an operation that
aims to find relational tables on the web, a.k.a.
webtables, that can be unioned with a query table. Two tables
are considered unionable if their column values are drawn
from the same domains. However, in the context of the
web, the domains of columns are not known or fixed
and existing approaches often rely on measures such as
value overlap to find columns with the same domains,
a.k.a unionable columns. But using this generic notion
of unionability in downstream tasks can be challenging.</p>
      <p>Consider a query table  as shown in Table 1, and
suppose we want to expand  vertically or horizontally
by adding more rows or columns. TUS returns the same
list of webtables regardless of the follow-up operations.
Also TUS fails to identify and leverage complex
relationships between tables and columns, often returning
near-duplicate webtables.</p>
      <p>To address these issues, we introduce preferences to
TUS, allowing for more eficient and efective retrieval of
unionable webtables based on user-defined criteria.
Preferences help in tailoring the results to diferent follow-up
operations and mitigating the limitations of TUS. By
incorporating these preferences, we aim to improve the
usefulness and relevance of the TUS output for various
applications. The contributions of this research are as
follows:
(D. Rafiei)</p>
      <p>Country( 1)</p>
      <p>Canada</p>
      <p>USA
China
Iran</p>
      <p>City( 2)
Edmonton
New York
Shanghai
Tehran</p>
      <p>Population( 3)
• We introduce four major preferences to the TUS
operation: skyline, novelty, diversity and
dependent set.
• We introduce two benchmark datasets for
unionable webtables, constructed from Web Data
Commons 2015 [1] and WikiTables datasets [2].
• We propose eficient approaches for evaluating
each preference.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Works</title>
      <p>Tabular data may be searched using table-based
queries [3, 4, 5, 6] and keyword-based approaches [7,
8, 9, 10]. Table-based approaches assume that the search
query is a table while keyword-based assume it is a set of
focusing on enriching the results of TUS.</p>
      <p>Preferences have long been studied in databases (e.g.,
see the general survey [11], the foundational work [12]
and preferences in SQL [13]) but we are not aware of them
being applied to TUS. Khatiwada et al. [14] investigate
the semantics and relationships of columns to identify
unionable webtables more accurately. This may seem
similar to our work on dependent set preference (to be
discussed next), but there are some subtle diferences.
of value overlap, and other metrics such as semantic
(, )
(, )
refer to such pair as a unionability pair, shown as   ↔   .</p>
      <sec id="sec-3-1">
        <title>Definition 3.1 (Alignment).</title>
        <p>two tables (
1,  2, … ,   ) and (</p>
        <sec id="sec-3-1-1">
          <title>An alignment between</title>
          <p>1,  2, … ,   ) is a bijective
mapping between all columns of  and  and is shown as</p>
          <p>using the
correspondThe goodness score of align-   dominates a data point   represented by vector</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Definition 3.5 (TUScore).</title>
        <p>The table unionability score
of webtable  with respect to query table  is the maximum
 
of alignments in (, )
search as the task that returns the top- most unionable
candidate webtables for a query table  (Definition
3.6).</p>
      </sec>
      <sec id="sec-3-3">
        <title>Definition 3.6 (Top-k TUS).</title>
        <sec id="sec-3-3-1">
          <title>Given a query table</title>
          <p>1,  2, … ,   ) with degree  , Top-k TUS is the task of
ifnding the top-  candidate webtables in the corpus with
the highest unionability score to  .</p>
          <p>To find the top-k TUS candidates, the “Weak And”
algorithm (WAND) is utilized [15]. WAND is a safe-ranking
document-at-a-time technique that prunes many
candidate webtables early in the process without fully
examining them. The WAND algorithm works over inverted
indexes and achieves its eficiency by using a threshold
score to limit the number of documents that need to be
examined, drastically reducing the search time. To adapt
WAND to this context, we consider candidate webtables
as documents and each of their columns as a term. The
weight of each column would be its Jaccard score with
the query column under consideration. We refer to this
implementation as TUS. We also add some filters such as
candidate webtables should cover a key of query table to
TUS and refer to it as TUS+.
 ((, )) = 
where 
|(, )|
|(, )|
(  ((, )))
(2)
is</p>
          <p>for the probability distribution
corresponding to the size of alignment (, )</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Preferences</title>
      <p>4.1. Skyline
Skyline preference has been extensively studied in the
literature [16, 17]. It is commonly defined over a
corpus of data points, each represented as a numeric vector,
where one looks for those data points which are not
dominated by others. A data point   represented by vector
and shown as   ≻   if it has as good as or better values
over all dimensions and a better value over at least one
dimension [17]. Our approach is to represent each
alignment (, )</p>
      <p>as a vector   of size equal to the number of
query columns with each element showing the
unionability score of a pair of columns. Following this idea, each
candidate webtable with multiple alignments can be
represented as a set of numeric vectors. An observation is
of align- that for one candidate webtable, some of the alignments
are a subset of the others which we prune them early in
the process. Finally, by utilizing existing algorithms on
skyline such as SaLSa [18] or BBS [19], we can return the finding webtables with the most diverse values over a
candidate webtables with the best alignment over each
subset of query columns. Hence, we can use the same
subset of query columns (Definition</p>
      <p>4.1).</p>
      <sec id="sec-4-1">
        <title>Definition 4.1 (Top-k Skyline).</title>
        <sec id="sec-4-1-1">
          <title>With each alignment</title>
          <p>between a candidate webtable  and a query table 
represented as a numeric vector (as discussed), the Top-k Skyline
of  is the top- candidate webtables with the most number
of skyline vectors.
4.2. Diversity
Diversity is a well-studied preference in the literature
with the purpose of resolving ambiguity by returning
diverse results [20, 21, 22]. The proposed approaches
usually involve a scoring function [22] that balances the
diversity of the results while promoting their relevance
to the search query.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Definition 4.2 (Top-k Diversity).</title>
        <p>The Top-k Diversity
of query table  over subset  of query columns is the list 
of webtables from dataset</p>
        <p>which maximizes the following
objective function and are sorted based on their TUScore.
   (, , ) =
⊆ ,||=
argmax ( − 1).(1 − ). Rel(, , ) + .</p>
        <p>Dif (, ),
where the parameters  and ( − 1) in this formulation
control the scores’ contribution and to bring them to the
same scale respectively,
||
∑    
=1
(, , ) =
||</p>
        <p>(  , ),
is defined as the harmonic mean of the
and  (, )
query table.
diferences of tables in  from each other and from the</p>
        <p>As finding such a list  is a computationally hard task
[22], we utilize a greedy approach based on Yu et al.’s</p>
        <p>algorithm [23] which starts with a list of 
webtables and iterates over the rest of webtables to see if
swapping outsiders with insiders improves the diversity score.
We modified the</p>
        <p>algorithm to choose the initial list
more carefully and iterate over webtables in a specific
(4)
(5)
order.
4.3. Novelty
Novelty is a well-known preference which has been
studied in the literature with the purpose of avoiding
redundancy in the returned result [24, 25, 26, 22]. In the context
of our work, we want to avoid redundant webtables and
return those ofering as much unique new values as
possible over a subset of query columns. It can be considered
as a special case of diversity preference which aims at
framework discussed for the diversity preference with
some modifications in the proposed scores. The objective
function for novelty consists of two scores, 
and  
The first score,  , is the same as the one we defined for
concatenate candidate columns in the same order we did
for  ′ to get to a new candidate column  ′. The last step
is to compute the table unionability score over the new
set of candidate and query columns and to return the
top-k ones to the user.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluations</title>
      <p>To assess the efectiveness of the proposed preferences,
we evaluate their performance in two down-stream tasks:
i) expanding query table by adding new rows and ii)
extending it by adding new columns. As a measure of
performance, we utilize Average F1-Measure@k,
calculated by averaging the f1-measure@k scores over query
tables.
5.1. Datasets</p>
      <p>Existing datasets [6] only provide a limited selection of
unionable candidates, and they fail to adequately
showcase the impact of preferences. We introduce two datasets
for our evaluation: (1) Web Data Commons 2015 (WDC)
dataset [1], and (2) WikiTables dataset [2]. We built our Figure 2: Task 2: Average F1 Measure@k for query extension
dataset of query and candidate webtables from these base over both datasets.
tables by randomly selecting rows and random projection
of columns. Applying these operations on a base table is
expected to generate new tables that are unionable with Novelty returns webtables with more new rows for the
the base table by the definition of unionability [ 27]. query table. For  &gt; 5 , dependent sets also outperform
WDC 2015. We selected 50 webtable with at least 9 TUS and TUS+ in terms of performance. Interestingly,
columns and 900 records from the WDC 2015 [1] dataset. some of the webtables have both new rows and new
We checked them manually and pruned those with less columns for the query table, which explains why
diverthan 5 text columns or with non-English content and sity and dependent sets excel in this task.
ended up with 14 webtables as our base tables. We
sampled 101 tables from each base table, as discussed above, 5.3. Task 2: Query Table Extension
and designated one table as our query table and the
remaining 100 tables as candidate webtables. This gave us TUS may include webtables unsuitable for extending the
a total of 14 query tables and 1400 candidate webtables. query table with new columns. When all columns of a
WikiTables. We did a similar process as the one for the webtable is unionabile with at least one query column,
WDC dataset. We first selected top 100 webtables with at no new columns may be added to the query table. Figure
least 9 columns and 300 rows from the WikiTables dataset 2 shows that diversity and novelty outperform TUS and
[2]. Then, we pruned those with non-English content. In TUS+. Diversity focuses on returning webtables with
addition, since we wanted to guarantee diferent levels more new columns. Dependent sets also outperform
of unionability among query and candidate columns, we TUS and TUS+ for  &gt; 5 .
pruned those tables with many columns that had only
‘yes/no’ values in their content. Finally we ended up 6. Conclusions
with 12 base tables which we used to generate query
and candidate webtables from. We ended up with 12
query tables, one per each base table, and 1200 candidate
webtables, 100 candidates per each base table.
5.2. Task 1: Query Table Expansion
TUS struggles in expanding the query table with
additional rows. Figure 1 shows that novelty and diversity
outperform other approaches such as TUS and TUS+.</p>
      <p>Existing approaches for finding unionable tables
primarily focus on eficiently providing an approximate ranked
list of candidate tables to the user, which may limit access
to suitable tables for follow-up operations. In this paper,
we explore the power of preferences, showing that users
can retrieve tables that are not only unionable with the
query table but are also suitable for various follow-up
operations. A possible future direction is exploring more
eficient approaches to support preferences with TUS.
retrieval process, in: Proceedings of the twelfth
international conference on Information and
knowl[1] O. Lehmberg, D. Ritze, R. Meusel, C. Bizer, A large edge management, 2003, pp. 426–434.
public corpus of web tables containing time and [16] C. Kalyvas, T. Tzouramanis, A survey of skyline
context metadata, in: Proceedings of the 25th In- query processing, arXiv preprint arXiv:1704.01788
ternational Conference Companion on World Wide (2017).</p>
      <p>Web, 2016, pp. 75–76. [17] J.-H. Choi, F. Hao, A. Nasridinov, Hi-sky: Hash
[2] C. S. Bhagavatula, T. Noraset, D. Downey, Methods index-based skyline query processing, Applied
Scifor exploring and mining tables on wikipedia, in: ences 10 (2020) 1708.</p>
      <p>Proceedings of the ACM SIGKDD workshop on [18] I. Bartolini, P. Ciaccia, M. Patella, Salsa:
Computinteractive data exploration and analytics, 2013, pp. ing the skyline without scanning the whole sky, in:
18–26. Proceedings of the 15th ACM international
confer[3] S. Zhang, K. Balog, Recommending related tables, ence on Information and knowledge management,
arXiv preprint arXiv:1907.03595 (2019). 2006, pp. 405–414.
[4] Y. Zhang, Z. G. Ives, Finding related tables in data [19] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive
lakes for interactive data science, in: Proceedings of skyline computation in database systems, ACM
the 2020 ACM SIGMOD International Conference Transactions on Database Systems (TODS) 30 (2005)
on Management of Data, 2020, pp. 1951–1966. 41–82.
[5] T. Cong, H. Jagadish, Pylon: Table union search [20] D. Rafiei, K. Bharat, A. Shukla, Diversifying web
through contrastive representation learning, arXiv search results, in: Proceedings of the 19th
interpreprint arXiv:2301.04901 (2023). national conference on World wide web, 2010, pp.
[6] F. Nargesian, E. Zhu, K. Q. Pu, R. J. Miller, Table 781–790.</p>
      <p>union search on open data, Proceedings of the [21] M. R. Vieira, H. L. Razente, M. C. Barioni, M.
HadVLDB Endowment 11 (2018) 813–825. jieleftheriou, D. Srivastava, C. Traina, V. J. Tsotras,
[7] R. Pimplikar, S. Sarawagi, Answering table queries On query result diversification, in: 2011 IEEE
on the web using column keywords, arXiv preprint 27th International Conference on Data Engineering,
arXiv:1207.0132 (2012). IEEE, 2011, pp. 1163–1174.
[8] L. Deng, Table2Vec: Neural word and entity embed- [22] M. R. Vieira, H. L. Razente, M. C. Barioni, M.
Haddings for table population and retrieval, Master’s jieleftheriou, D. Srivastava, C. Traina Jr, V. J.
Tsothesis, University of Stavanger, Norway, 2018. tras, Divdb: A system for diversifying query
re[9] Z. Chen, M. Trabelsi, J. Heflin, Y. Xu, B. D. Davison, sults, Proceedings of the VLDB Endowment 4 (2011)
Table search using a deep contextualized language 1395–1398.
model, in: Proceedings of the 43rd International [23] C. Yu, L. Lakshmanan, S. Amer-Yahia, It takes
vaACM SIGIR Conference on Research and Develop- riety to make a world: diversification in
recomment in Information Retrieval, 2020, pp. 589–598. mender systems, in: Proceedings of the 12th
in[10] A. Bogatu, A. A. Fernandes, N. W. Paton, N. Kon- ternational conference on extending database
techstantinou, Dataset discovery in data lakes, in: 2020 nology: Advances in database technology, 2009, pp.
IEEE 36th International Conference on Data Engi- 368–378.</p>
      <p>neering (ICDE), IEEE, 2020, pp. 709–720. [24] C. L. Clarke, M. Kolla, G. V. Cormack, O.
Vechto[11] K. Stefanidis, G. Koutrika, E. Pitoura, A survey mova, A. Ashkan, S. Büttcher, I. MacKinnon,
Novon representation, composition and application of elty and diversity in information retrieval
evalpreferences in database systems, ACM Transactions uation, in: Proceedings of the 31st annual
inon Database Systems (TODS) 36 (2011) 1–45. ternational ACM SIGIR conference on Research
[12] P. Ciaccia, D. Martinenghi, R. Torlone, Foundations and development in information retrieval, 2008, pp.
of context-aware preference propagation, Journal 659–666.</p>
      <p>of the ACM (JACM) 67 (2020) 1–43. [25] P. Castells, N. Hurley, S. Vargas, Novelty and
diver[13] W. Kießling, M. Endres, F. Wenzel, The preference sity in recommender systems, in: Recommender
sql system-an overview, IEEE Data Engineering systems handbook, Springer, 2022, pp. 603–646.</p>
      <p>Bulletin 34 (2011) 12–19. [26] T. Ghosal, T. Saikh, T. Biswas, A. Ekbal, P.
Bhat[14] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gat- tacharyya, Novelty detection: A perspective from
terbauer, R. J. Miller, M. Riedewald, Santos: natural language processing, Computational
LinRelationship-based semantic table union search, guistics 48 (2022) 77–117.</p>
      <p>arXiv preprint arXiv:2209.13589 (2022). [27] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca,
[15] A. Z. Broder, D. Carmel, M. Herscovici, A. Sofer, W. Shen, F. Wu, G. Miao, Recovering semantics of
J. Zien, Eficient query evaluation using a two-level tables on the web (2011).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>