1. Introduction

1613-0073

An Eficient Diversity-Aware Method for the Empty-Answer Problem

Yuto Ikeda

ikeda.yuto@ist.osaka-u.ac.jp 0 1

Chuan Xiao

chuanx@ist.osaka-u.ac.jp 0 1

Makoto Onizuka

onizuka@ist.osaka-u.ac.jp 0 1 0 Osaka University , 1-5 Yamadaoka, Suita, Osaka 565-0871 Japan 1 Workshop Proce dings

This study tackles the empty-answer problem in database queries, where no results meet all user-specified conditions, though some may satisfy individual ones. We explore query relaxation, which removes certain conditions for results, and a record search method that uses user preferences to evaluate records. In particular, we emphasize the importance of diversity in the results to better match user preferences, which has been ignored in existing approaches. To address this, we introduce the use of Maximal Marginal Relevance (MMR) - a ranking function balancing query relevance and record diversity - for query relaxation, proposing a method that searches for diverse record sets while maintaining many conditions. Experiments with real-world datasets demonstrated that the proposed method significantly increases search speed (up to 300 times faster) while maintaining high MMR scores, indicating an efective balance between eficiency and result diversity.

empty-answer problem query relaxation maximal marginal relevance

1. Introduction

In various applications, setting desired conditions and searching for data is a fundamental operation. Ideally, searches should yield a small number of records that match the specified conditions. However, the number of records retrieved can vary significantly depending on the user’s answer problem) or no records at all (the empty-answer problem) [1]. Whereas the many-answer problem can be addressed by preseting the top- results (e.g., with a LIMIT clause), the empty-answer problem is more challenging. The causes of the empty-answer problem can be categorized into two scenarios: ( 1 ) records satisfy each condition individually but not collectively due to multiple conditions, and ( 2 ) the conditions are invalid, such as searching for records that do not meet pre-set constraints in the data. In this paper, we target the empty-answer problem and address the first scenario, where potentially all records in the database fall within the search scope.

To solve the empty-answer problem, existing methods are broadly classified into two approaches. The first is the tions set by the user are reduced to solve the empty-answer problem. The second is the ranking method [5], which translates user preferences into a function to evaluate records. It efectively reflects user preferences, especially when users can design the ranking function.

In the user’s record set, the relevance to the query and the diversity within the recommended set are both imporis key to capturing user preferences [6]. While various approaches have been proposed, none have focused on the diversity of the record set post-solving the emptyanswer problem in the database field. Meanwhile, Maximal Marginal Relevance (MMR) [7], a ranking function balancing query relevance and record diversity, has been proposed and widely used for information retrieval.

In this study, we aim to solve the empty-answer problem by considering diversity and utilizing MMR as the ranking function. We formalize MMR for relational database applications and propose a method for quickly finding a record set that maximizes MMR, Also, We devise a series of relaxed query search and record search techniques tailored to this objective. They broaden results by removing some functions, respectively. In addition, we utilize cardinalprocess. Experiments with real-world datasets show that our method significantly increases search speed (up to more than 300 times faster) while maintaining high MMR scores and outperforming baseline approaches. query relaxation method [2, 3, 4], where conjunctive condi- conditions and evaluate records based on user-defined DOLAP 2024: 26th International Workshop on Design, Optimization, ity estimation techniques to further optimize the search nEvelop-O LGOBE (M. Onizuka) (M. Onizuka) Languages and Analytical Processing of Big Data, co-located with ∗Corresponding author.

https://sites.google.com/site/chuanxiao1983 (C. Xiao); http://

2. Preliminaries

0000-0001-5559-8300 (M. Onizuka) CEUR

We denote the query provided by the user as , and assume CEUR

ceur-ws.org with each condition represented as for 0 ≤ ≤ − 1 .

Then, the query can be expressed as: =

−1 ⋀ =0 .

Next, we define the set of conditions of the query as , and the power set of as (

). A relaxed query ′, derived from , is formulated using a proper subset ⊆ ( ) ⧵ { } as follows: ′ = ⋀ .

∈

Since the number of elements in ( are 2 − 2 candidates for ′, excluding itself. We denote the total number of records in the database as , the total number of columns in a record as , and the number of records the user wishes to obtain as a query result as .

)is 2 − 1, there ( 1 )

3. Proposed Method 3.1. Base Algorithm

MMR. MMR is a combined score comprising ( 1 ) the rele- tween every pair of records (as per Equation 4), a task vance of the recommended record set to the user-specified query, and ( 2 ) the diversity within the recommended record set. When the recommended record set is defined as ′, MMR is defined as follows [ 7]: (, ′) = (, ′) + ( ′ ) relevance and diversity. where is a parameter that adjusts the balance between

We define relevance and diversity as follows: (, ( ′) = ′) = { 1

∑ ∈ ′ 1

−1 ∑ =0 ( , )

min, ′∈ ′,≠ ′ ( (,

if ( ′ ) ) if ( ′) = 1 ′) ≠ 1

condition

, and 0 otherwise. ( where ( , ) returns 1 if the recor′d satisfies the query

)denotes # records in ′ Intuitively, Equation 3 represents the average ratio of the number of matching conditions to the total number of conditions for each record in ′. Additionally, the

function in Equation 4 defines the distance between records, and any distance function can be applied. In this study, the Manhattan distance is used for numerical data, and the Hamming distance for categorical or binary data.

Problem Definition.

The problem is defined as follows: given a query that yields an empty-answer for single table dataset , the objective is to identify a subset of records ′ ⊆ , consisting of records, where k is the number that user designated, that maximizes (,

′).

MMR is a metric designed for evaluating a set of records.

To identify the recommended record set that maximizes MMR, it is necessary to calculate and compare the distances between all pairs of the records. This process incurs (

)-time, which becomes impractical, particularly for large values of . As a result, methods have been proposed to search for the recommended record set in a greedy manner [8] for the sake of eficiency. (, (,

( ( 2 ) ( 3 )

To identify a recommended record set that maximizes the MMR score, it is necessary to calculate the distance bethat is computationally intensive. Consequently, existing methods [8, 9] employ heuristic approaches to find approximate answers, including the state-of-the-art method [9] which incurs a time complexity of ()

to find these approximate answers.

Our approach also utilizes a heuristic method to eficiently find approximate solutions. Considering that a query can be viewed as an abstraction or specification of its resulting records, we introduce the concept of querylevel MMR as a preprocessing step for record-level MMR. generate multiple relaxed queries from the original query, aiming to maximize query-level MMR (as shown in the top right corner of the figure). Subsequently, we acquire the results of these relaxed queries (depicted at the bottom ( 4 ) right corner) and select a recommended record set from these results that maximizes the record-level MMR.

We define query-level MMR by substituting a recommended record set ( ′) with a relaxed query set ( ′) in

Equation 2 as follows:

′) = (, ′) + (

′).

Additionally, we introduce metrics for query relevance and diversity. Query relevance calculates the similarity between the original query and a relaxed query set, while query diversity measures the diversity within the relaxed query set. These metrics are defined as follows: ′) = 1

∑ ′∈ ′ ()

( ′) = ″, ′∈ ′ min , ′≠ ″ ( ′ ) ( ′ ∩ ″) ′ ∪ ″).

( 5 ) ( 6 ) ( 7 )

denotes the number of conditions in a query .

Our method comprises two stages: ( 1 ) searching for relaxed queries that maximize query-level MMR and obtaining their results; ( 2 ) selecting a recommended record set that maximizes record-level MMR from these results.

By relaxing the original query provided by the user, records that satisfy the relaxed query are considered as the user’s query while contributing to the diversity of the plexity of the proposed method is expressed as: candidates for the recommended record set. These candi- between the values of each column. date records should ideally satisfy more conditions from If the cardinality of is (0 ≤ ≤ 1), the time comrecommended record set. To achieve this, relaxed queries are selected to maximize query-level MMR.

As discussed in Section 2, there are 2 − 2 potential candidates for relaxed queries. Directly searching through all these candidates is impractical. In our proposed method, we record pairs of conditions used in previous relaxed queries and determine conditions greedily one by one for new relaxed queries. This approach ensures diversity in the records derived from relaxed queries by varying the conditions within each group, and it can be achieved with a time complexity of (

2).

For specific processing, initially, from the conditions in the original query, we select one condition that has been used the least in previous relaxed query groups for the new relaxed query. For subsequent conditions, when the set of already determined conditions is ′, the condition that minimizes ∑∈ ′ (,

) is chosen. Here, is a function that records the frequency of condition pairs appearing in relaxed queries from past iterations. For instance, if a user’s query contains conditions 1, 2, 3, and the first recommended record search included 1, 2 in its relaxed query, then (

1, 2) = 1 and (

In case of multiple conditions minimizing the tion, priority is given to the condition least used in prior relaxed query groups. Finally, records satisfying all query conditions just before the relaxed query yields no results are considered as candidates. After determining the relaxed query, all condition pairs in this query are recorded.

1, 3) = 0.

function. Starting from the second condition, evaluation is conducted only on the record set that meets all previously established conditions. This strategy significantly reduces the number of records evaluated for each condition.

After identifying candidates for the recommended record set, we search for the record that maximizes the MMR from that set. In doing so, we maintain the minimum distance between the records selected in the recommended record set to reduce computational costs. However, unlike existing methods, our proposed method does not conduct a full search of records, rendering the avoidance of duplicate calculations for additional record candidates infeasible.

3.2. Complexity Analysis

The exact time complexity of the proposed method is influenced by the proportion of records satisfying each condition in the query and the dependency relationship between the sets of records satisfying multiple conditions, which makes it challenging to calculate precisely. We therefore consider the time complexity under the general assumption that there is no dependency relationship When searching for a relaxed query, records are proWe revisit the proposed method. When determining the gressively narrowed down with each determined condiconditions of a query, the method does not explicitly ad(( 2 + +

−1 ∑ ing conditions for the relaxed query and recording the conditions used in ′, and the second term pertains to the complexity of calculating MMR for the record set obtained from ′ and determining the recommended record set. The number of additional records is not factored in here. The third term relates to the complexity of searching ′. The sum inside represents the number of records evaluated for each condition.

Considering these parameters, typically and are up to 100 or smaller, while is often larger. Therefore, when , , ≪ , the time complexity approximates to: ( −1

∑ (∏ )). =0 =0 ( 8 )

Equation 8 is , we have

For estimating complexity, consider a scenario where the selection rate is identical for all conditions. If in ( −1

∑ (∏ )) ≤ ( =0 =0 ∞ =0 ∑( ) = (

1 1 − ) ( 9 )

3.3. Further Optimizations

dress scenarios involving multiple relevant conditions. The primary objectives for obtaining a relaxed query, as mentioned earlier, are to maintain high diversity among queries and to retain as many conditions from the user’s query as possible. In scenarios where conditions have the same priority in the co-occurrence matrix, selecting any of these conditions would similarly uphold the diversity among queries. Consequently, when prioritizing these conditions, the focus should be on obtaining a relaxed query that preserves more conditions from the user’s query. For optimization, we propose to select from conditions with equal priority in the co-occurrence matrix those likely to yield more records after evaluation.

Methods for exploring the cardinality of conditions include: ( 1 ) direct evaluation of the condition to calculate the exact cardinality, and ( 2 ) utilizing cardinality estimation for an approximate value. Each of these methods presents its own advantages and disadvantages. Direct evaluation provides precise cardinality but may lead to longer execution times, especially when evaluating a few conditions over a large number of records. In contrast, cardinality estimation ofers more consistent execution times regardless

Dataset. We follow the existing research on query relaxation [4] to use the Cars dataset [13], which was released by Mottin et al. After removing duplicate records, this dataset comprises 128,443 records with 31 columns, all containing boolean values. We employed 167 of queries. These are the queries used in existing research [4] and these are consists of 4 − 10 conditions.

Competitors. In addition to the method proposed in Section 3 (proposed base method) and the enhanced approach described in Section 3.3 (proposed optimized method), we included two comparison methods in our experiments: a greedy method targeting the entire dataset (greedy) and a random selection method (random). The threshold for switching from cardinality estimation to direct execution in the proposed optimized method is set at 10. Environment. All experiments were performed on a MacOS Ventura 13.2.1 machine equipped with an Apple M2 CPU and 24 GB of main memory. For the implementation of all algorithms in the experiments, Python 3.8 was used. We employed PostgreSQL for data storage, ensuring a unified experimental environment between the proposed base method and the exhaustive search method. The reported execution times exclude the dataset loading times. To implement the cardinality estimation in the proposed optimized method, DeepDB [10] was utilized. The model used in this experiment was prepared beforehand.

5. Conclusion

In this study, we formulated an evaluation metric that considers both diversity and relevance in the field of databases. We proposed a method that searches for a record set that can be presented quickly compared to existing methods, solving the empty-answer problem. We further improved the method by employing cardinality estimation. In addition, we conducted an empirical evaluation on a dataset and queries used in previous research, confirming that our method achieves significant speed improvements while maintaining accuracy.

Acknowledgements

This work is supported by JSPS Kakenhi 22H03903, 23H03406, 23K17456, and CREST JPMJCR22M2.

[1]

Basu Roy ,

Wang , G. Das , U.

Nambiar , M.

Mohania , Minimum-efort driven dynamic faceted search in structured databases , in: CIKM , 2008 , pp. 13 - 22 .

[2]

Koudas ,

Li ,

A. K.

Tung ,

Vernica , Relaxing join and selection queries , in: VLDB , 2006 , pp. 199 - 210 .

[3]

Mishra ,

Koudas , Interactive query refinement , in: EDBT , 2009 , pp. 862 - 873 .

[4]

Mottin ,

Marascu ,

S. B.

Roy , G. Das,

Palpanas ,

Velegrakis , A holistic and principled approach for the empty-answer problem , The VLDB Journal 25 ( 2016 ) 597 - 622 .

[5]

Agrawal ,

Chaudhuri , G. Das , A. Gionis , Automated ranking of database query results , CIDR ( 2003 ).

[6]

Kaminskas ,

Bridge , Diversity, serendipity, novelty, and coverage: a survey and empirical analysis of beyond-accuracy objectives in recommender systems , ACM Transactions on Interactive Intelligent Systems 7 ( 2016 ) 1 - 42 .

[7]

Carbonell , J. Goldstein, The use of mmr, diversitybased reranking for reordering documents and producing summaries , in: SIGIR , 1998 , pp. 335 - 336 .

[8]

Catallo , E. Ciceri,

Fraternali ,

Martinenghi ,

Tagliasacchi , Top-k diversity queries over bounded regions , ACM Transactions on Database Systems 38 ( 2013 ) 1 - 44 .

[9]

Hirata ,

Amagata ,

Fujita , T. Hara, Solving diversity-aware maximum inner product search eficiently and efectively , in: CIKM, 2022 , pp. 198 - 207 .

[10]

Hilprecht ,

Schmidt ,

Kulessa ,

Molina ,

Kersting ,

Binnig , Deepdb: Learn from data, not from queries! , Proceedings of the VLDB Endowment 13 ( 2020 ) 992 - 1005 .

[11]

Yang ,

Liang ,

Kamsetty ,

Wu ,

Duan ,

Chen ,

Abbeel ,

J. M.

Hellerstein ,

Krishnan , I. Stoica , Deep unsupervised cardinality estimation , arXiv preprint arXiv: 1905 . 04278 ( 2019 ).

[12]

Ito ,

Sasaki ,

Xiao ,

Onizuka , Scardina: Scalable join cardinality estimation by multiple density estimators , arXiv preprint arXiv:2303.18042 ( 2023 ).

[13]

Mottin ,

S. B.

Roy ,

Marascu , , G. Das,

Palpanas ,

Velegrakis , A holistic and principled approach for the empty-answer problem , https://helios2.mi. parisdescartes.fr/~themisp/queryrelaxation, 2023 .