<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Dense Retrieval of Open Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qiaosheng Chen</string-name>
          <email>qschen@smail.nju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dataset Search</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dense Retrieval</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Open Data</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State Key Laboratory for Novel Software Technology, Nanjing University</institution>
          ,
          <addr-line>Nanjing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The rapid growth of open data has intensified the need for efective dataset search capabilities. This research proposal focuses on enhancing dataset search through content-based dense retrieval, addressing the limitations of current metadata-dependent systems. This research aims to tackle the challenges of dataset size, heterogeneity, and the creation of a comprehensive test collection for evaluation. The proposed research methods include data summarization techniques for large datasets and a unified representation of heterogeneous data, which are inspired by research related to the Semantic Web. Additionally, the research will explore a coarse-to-fine tuning strategy for dense retrieval models, leveraging data augmentation through distant supervision and self-training. The evaluation plan involves constructing a content-based test collection and comparing retrieval performance between metadata-only and content-enhanced approaches. The expected outcome is the development of efective content-based dataset search solutions, ultimately improving data findability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The availability and significance of open data have led to a surge in interest and reliance on
dataset search within the field of information retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, represented by Google
Dataset Search [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], existing approaches and systems predominantly rely on metadata (descriptive
text for dataset, such as title, description), which often sufers from low quality and limited
availability. These metadata-based approaches have posed shortage in accurately capturing the
relevance of datasets. For addressing the gap between users’ real data needs and the quality of
dataset metadata, it necessitates a shift towards content-based approaches that can efectively
harness the richness of dataset content [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        On the other hand, dense retrieval models, which have become mainstream in the field
of document retrieval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have not yet been fully explored in the field of dataset search. In
particular, how to apply dense retrieval models to content-based dataset search problems still
faces many challenges. First, the large size of dataset content poses computational challenges,
especially when it exceeds the processing capacity of standard dense retrieval models which are
mainly based on pre-trained language models (PLMs). Additionally, the heterogeneity of dataset
content, spanning various data formats and domains [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], further complicates the development
of unified content-based search solutions.
      </p>
      <p>Proceedings of the Doctoral Consortium at ISWC 2024, co-located with the 23rd International Semantic Web Conference
(ISWC 2024)
nEvelop-O
LGOBE
https://cqsss.github.io (Q. Chen)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>The proposed research aims to contribute towards the development of robust and efective
content-based solutions for dataset search, ultimately improving the findability and reusability of
open datasets. Users across various domains, including researchers, data scientists, policymakers,
and businesses, will benefit from content-based dataset search, while professional researchers
in fields such as information retrieval, natural language processing (NLP), and machine learning
are particularly invested in its advancement. Industries relying heavily on data-driven
decisionmaking, such as healthcare, finance, agriculture, and environmental science, should also care
about its development. Beyond the domain of information retrieval, this research involves
technologies relevant to the Semantic Web and Knowledge Graph (KG). RDF datasets represent
a significant part of open data. Moreover, employing ontologies or KGs as a framework can
aid in analyzing the content of open datasets and processing heterogeneou data from a unified
perspective. The advancement of dataset search also stands to catalyze the realization of findable,
accessible, interoperable, and reusable (FAIR) open data within the Semantic Web community.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <sec id="sec-3-1">
        <title>2.1. Dataset Search</title>
        <p>In this section, we review recent advancements in dataset search and dense retrieval, highlighting
limitations of current dataset search methods and examining strengths of dense retrieval
techniques.</p>
        <p>
          Dataset search has garnered increasing attention with the proliferation of diverse and
voluminous datasets, prompting the development of search approaches and systems [
          <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
          ]. Notably,
Google Dataset Search [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] has paved the way as a pioneering dataset search engine, enabling
keyword retrieval over published metadata of Web datasets. However, its reliance on metadata
limits its efectiveness in supporting queries oriented towards dataset content. Moreover,
existing dataset retrieval test collections [
          <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
          ] primarily depend on metadata annotations during
construction, resulting in a lack of evaluation benchmarks for content-based dataset search.
        </p>
        <p>
          Recent studies have highlighted the importance of integrating considerations for dataset
content to enhance search efectiveness. Ota et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] utilized value co-occurrence information
within tabular datasets to infer attribute domains, while Chen et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] proposed a BERT-based
ranking model for table retrieval, focusing on selecting the most salient table items as
representatives of the entire dataset. StruBERT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] introduced a structure-aware BERT model to capture
both structural and textual information of tabular datasets. Moreover, existing tabular dataset
or RDF dataset search systems such as Auctus [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], LODAtlas [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and CKGSE [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] leverage
dataset content to augment retrieval capabilities and enhance user search experiences. However,
these eforts primarily focus on single-format data, such as tabular or RDF data, overlooking the
challenges posed by multi-format datasets.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Dense Retrieval</title>
        <p>
          Recent advancements in dense retrieval have been significantly influenced by the incorporation
of PLMs, which have demonstrated remarkable capabilities in capturing semantic nuances
within text [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This approach, often referred to as dense retrieval, leverages the dense vector
representations (embeddings) of text to facilitate semantic matching between queries and
documents. Notably, Karpukhin et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] presented dense passage retrieval (DPR) for
opendomain question answering, highlighting the efectiveness of PLMs in this context. Their work
has been seminal in shaping subsequent research. The concept of using multiple representations
for improved text encoding has been explored by Humeau et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] through their poly-encoder
architectures, which allow for richer semantic interactions between queries and texts. The
challenge of training eficient and robust dense retrievers has been addressed by various work.
For instance, Gao and Callan [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] introduced Condenser, a pre-training architecture specifically
designed to improve dense retrieval. Nogueira et al. [20] demonstrated the efectiveness of
multistage document ranking using BERT, showcasing how PLMs can be efectively integrated into
reranking stage. Furthermore, ColBERT [21] has provided insights into eficient and efective
passage search through contextualized late interaction over BERT. Most of the current dense
retrieval methods focus on retrieval of text documents or passages, whereas the structured content
of datasets requires new dense model structures or retrieval strategies. The large data size and
complex heterogeneity also make it dificult to directly treat the dataset content as plain text.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Problem Statement</title>
      <p>In this section, we discuss the typical composition of datasets, outline the problem of
contentbased dataset search, and formulate hypotheses and research questions derived from our
investigation.</p>
      <p>To clarify the distinction between dataset search and general document search, we first
introduce the composition of a dataset, which consists of the following two parts:
• Metadata: This part includes descriptive fields provided by the dataset publisher, such as
title, description, publishing organization, and other useful information about the dataset.
• Data Files: A dataset consists of various data files, potentially in diferent formats. This
research only considers textual data files, including unstructured TXT, PDF, and DOC
ifles, as well as structured files such as graph data (RDF, OWL), tabular data (CSV, XLS),
and key-value pair data (JSON, XML). Images (JPEG, PNG), videos (AVI, MPG), audios
(WAV, MP3), and other non-textual formats are excluded from the scope of this research.</p>
      <p>
        The research focuses on ad hoc dataset retrieval [
        <xref ref-type="bibr" rid="ref3 ref8">8, 3</xref>
        ], the foundational form of dataset
search. This process involves retrieving, from a collection  of datasets, a ranked list of
datasets ⟨ 1,  2, …⟩ that are most relevant to a keyword query  . The relevance assessment
between query  and each dataset  ∈  is conducted independently of other datasets  ′ ∈  ,
where  ≠  ′. The primary objective is to compute the relevance score of each dataset  ∈ 
to a given keyword query  . The prevalent dense retrieval paradigm typically employs a PLM
as an encoder (⋅) to encode a dataset  and a query  into vectors v and v respectively.
Subsequently, it computes the similarity score between these vectors to gauge the relevance
of  to  .
      </p>
      <p>
        According to studies on metadata quality [22, 23], the metadata of open datasets on the Web
often lacks guaranteed quality and is underutilized by both publishers and users. Meanwhile,
dense retrieval models based on PLMs have exhibited increasingly powerful text understanding
capabilities with advancements in NLP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Hence, the application of dense retrieval models in
dataset search becomes imperative. Based on these findings, we propose the following main
hypothesis and research question:
      </p>
      <p>Hypothesis. Dataset metadata quality often varies and may not fully describe the content.
Users frequently seek information from the actual data files, and content-focused queries may
not align well with the available metadata.</p>
      <p>RQ0. To what extent can content-based dense dataset retrieval methods outperform
traditional metadata-centered approaches?</p>
      <p>Building upon the hypothesis and RQ0, this research investigates the application of dense
models to content-based dataset retrieval. Nonetheless, representing and indexing complex
dataset content with PLM-based dense models presents substantial challenges. To address these
challenges, we decompose RQ0 into the following four specific research questions:</p>
      <p>RQ1. How to overcome the challenge presented by the extensive size of dataset content,
especially when it exceeds the processing capacity of dense retrieval models?</p>
      <p>RQ2. How to address the heterogeneity of dataset content, encompassing variations in
formats?</p>
      <p>RQ3. How to develop a dataset retrieval test collection which considers the content of
datasets, rather than annotated solely relies on metadata?</p>
      <p>RQ4. How to enhance the size and quality of existing public dataset retrieval test collections,
particularly in terms of providing suficient training data for dense retrieval models?</p>
    </sec>
    <sec id="sec-5">
      <title>4. Research Methods</title>
      <p>
        In this section, we provide a detailed and systematic research methodology that outlines how
we address each research question (RQ1-RQ4) and validate our hypotheses [
        <xref ref-type="bibr" rid="ref3">3, 24, 25, 26</xref>
        ]. The
methodology is structured to ensure a comprehensive and coherent approach to solving the
challenges of content-based dense retrieval of open datasets.
      </p>
      <p>RQ1. To overcome the challenge posed by the large size of dataset content that exceeds
the input capacity of PLMs, this research proposed an approach involving the extraction of
a data summary for each dataset. Starting with RDF datasets, we introduced a technique to
handle large RDF datasets by extracting a compact, representative subset of RDF triples [25].
This subset was selected to preserve the semantic integrity of the dataset and was used to
create a document representation that fits within the token limit of dense ranking models.
We employed two of the existing static RDF dataset summarization methods, IlluSnip [27, 28]
selecting top-ranked RDF triples covering the most frequent classes, properties, and entities,
and PCSG [29] extracting a connected subgraph from an RDF graph covering as many data
patterns as possible. Furthermore, we proposed a dynamic data summary extraction method
for dataset search, selecting compact data snippets of appropriate size that are relevant to
the user query [26]. By integrating these methods, one can create a compact, semantically
representative, and query-biased data summary of the original dataset. This enables the use of
PLMs for tasks such as dense dataset retrieval, where the models can process the summarized
data to understand and rank datasets based on their relevance to user queries without being
hindered by size limitations.</p>
      <p>RQ2. We address the challenge of heterogeneity in dataset content by transforming data
from various formats into a unified representation. The method establishes mapping rules for
structured data, such as graph data, tabular data, and key-value pair data. These rules convert
the heterogeneous data into unified data chunks. Each data chunk is modeled as a set of data
triples, which consist of a subject, a predicate, and an object. This triple-structured format allows
for uniform processing of all datasets, regardless of their original format. Converting diferent
data formats into unified data chunks creates a consistent input for dense ranking models. This
approach allows for the exploitation of heterogeneous data in dataset ranking, overcoming the
limitations imposed by the diverse formats of open data. The summarized data chunks can then
be used to rank datasets based on their relevance to a given query, thus enhancing the search
accuracy and making the process more eficient. To ensure that the structural information is not
lost during the conversion process, the mapping rules we employed preserve the hierarchical and
relational aspects of the original data. For graph data, we maintain the relationships between
nodes and edges by representing them as triples. For tabular data, we preserve the row-column
structure by mapping rows to subjects and columns to predicates. For key-value pair data, we
maintain the key-value relationships by representing keys as predicates and values as objects.
This approach ensures that the structural integrity of the original data is preserved, which is
beneficial for accurate retrieval. Additionally, we conduct experiments to evaluate the impact
of content on retrieval performance, providing insights into the importance of preserving this
information during the conversion process.</p>
      <p>
        RQ3. We released a content-based RDF dataset retrieval test collection ACORDAR [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and
subsequently enhanced it to build ACORDAR 2.0 [24]. Constructing this content-based dataset
retrieval test collection began with the collection of RDF datasets from various open data portals,
ensuring a diverse and representative sample. Keyword queries were then formulated, either by
analyzing user needs or through crowd-sourcing, resulting in a set of search terms that reflected
actual information demands. To accommodate the complexity and size of datasets, a dashboard
was developed to assist annotators in browsing and understanding the content of datasets.
This tool was crucial for creating content-oriented queries and making informed relevance
judgments. Annotators used the dashboard to analyze datasets and generate queries that capture
the dataset’s essence. These queries were then used to pool potentially relevant datasets, which
were subsequently annotated for relevance. The pooling process was done using both sparse and
dense retrieval models to ensure a broad coverage of potential matches. Relevance judgments
were made on a graded scale, with annotators assessing the degree to which each dataset met
the query’s requirements. To ensure quality, annotations involved multiple annotators and a
validation process. ACORDAR 2.0 was further enriched by transforming keyword queries into
question-style queries using a large language model (LLM), which increased the diversity of the
queries and simulates more natural information-seeking behavior. Our test collection provides
a benchmark for evaluating content-based dataset retrieval systems.
      </p>
      <p>RQ4. To address the challenge of limited large labeled datasets necessary for training dense
retrieval models, we proposed a coarse-to-fine tuning strategy [ 25]. This strategy involved an
initial coarse-tuning phase with weak supervision obtained from a large set of automatically
generated queries and relevance labels. It incorporated two data augmentation methods: distant
supervision and self-training. In the distant supervision method, the title of each dataset served
as a query, and the metadata document was assumed to be relevant to this query, thereby
generating numerous labeled examples. Meanwhile, the self-training method employed
datasetto-query generators trained on labeled data to generate queries from unlabeled data, further
expanding the datasets for training dense models.</p>
      <p>This systematic methodology ensures that each research question is addressed with a clear
and structured approach, leading to the validation of our hypotheses and the development of
efective content-based dataset search solutions.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Evaluation</title>
      <p>
        The evaluation plan for this research involves constructing content-based dataset retrieval
test collections [
        <xref ref-type="bibr" rid="ref3">3, 24</xref>
        ] following the methodology outlined in Section 4. Dataset retrieval
and reranking experiments will be conducted on these test collections, as well as on existing
public dataset retrieval test collections [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Performance will be assessed using commonly used
information retrieval metrics such as Recall, Normalized Discounted Cumulative Gain (NDCG),
and Mean Average Precision (MAP). The primary objectives of these experiments are as follows:
1. To compare the retrieval performance using solely metadata against retrieval using
metadata combined with content.
2. To assess the performance disparity between dense retrieval models and traditional sparse
retrieval models in the dataset search scenario.
3. To analyze the impact of various data summarization methods for representing data
content in dataset retrieval.
4. To investigate the efectiveness of diferent query types and characteristics in both
metadata-based and content-based retrieval methods.
      </p>
      <p>Comparison of Metadata-Only vs. Content-Enhanced Retrieval. We will conduct
experiments to compare the retrieval performance of systems that use only metadata against
those that combine metadata with content. This analysis will assess the extent to which
contentbased retrieval improves search accuracy and relevance. We will examine performance metrics
across various dataset types and query scenarios to identify specific cases where content-based
retrieval provides significant advantages.</p>
      <p>Performance Disparity Between Dense and Sparse Retrieval Models. We will compare
the performance of dense retrieval models, which use PLMs, with traditional sparse retrieval
models like BM25, which rely on term frequency-based scoring. This evaluation will highlight
the strengths and limitations of dense retrieval models in dataset search. By analyzing their
performance across diverse query types and datasets, we aim to identify scenarios where dense
models excel, particularly in capturing semantic relevance, versus scenarios where sparse
models may be more efective.</p>
      <p>Impact of Data Summarization Methods. The role of data summarization methods in
improving retrieval performance will be analyzed by testing both static techniques, such as
IlluSnip and PCSG, and dynamic methods, which generate query-biased summaries. We will
evaluate how these summarization approaches influence the relevance and eficiency of dataset
retrieval. Additionally, we will explore the trade-ofs between summarization quality and
computational cost, providing insights into balancing performance with resource demands.</p>
      <p>Query Type and Characteristic Analysis. A detailed examination of diferent query types
and their characteristics will be conducted to understand their efectiveness in metadata-based
and content-based retrieval methods. We hypothesize that specific queries requiring detailed
content comprehension or precision may benefit more from content-based retrieval. On the
other hand, more general queries or those that can be efectively addressed with metadata alone
may exhibit similar performance across both approaches. This analysis will help refine retrieval
strategies based on query requirements.</p>
      <p>In addition, given that the eventual deployment of this work is envisioned in real-world
dataset search applications, it is imperative to evaluate the time eficiency and additional
space requirements of modules such as data summarization and dense retrieval models when
processing real-world open datasets.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and Future Work</title>
      <p>The research proposal on content-based dense retrieval of open datasets is crucial in navigating
the vast landscape of available data resources. By shifting the focus from metadata to the actual
content of datasets, we can enhance search accuracy, ultimately facilitating more informed
decision-making in data discovery. The long-term value of this research lies in its potential to
streamline access to diverse datasets, empowering researchers, businesses, and policymakers
with valuable insights.</p>
      <sec id="sec-7-1">
        <title>6.1. Limitations and Challenges</title>
        <p>Content-based dataset retrieval systems face several limitations and challenges. Time eficiency
is a critical issue, as dense retrieval models and summarization techniques require significant
computational resources, particularly when processing large datasets. Storage requirements are
another concern, as pre-trained language models and their embeddings demand substantial space,
making deployment dificult in resource-constrained environments. The heterogeneity and
complexity of dataset formats further complicate retrieval, as it is challenging to develop unified
solutions that preserve both structural and semantic information. Evaluation is also problematic,
as constructing comprehensive and realistic test collections that reflect real-world scenarios is
complex yet crucial for assessing system performance. Finally, query understanding remains a
persistent challenge, particularly for complex queries that require detailed comprehension of
dataset content to map them efectively to relevant datasets.</p>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Future Research Directions</title>
        <p>Future research will focus on several directions to overcome these challenges and enhance
dataset retrieval systems. Integrating LLMs into dataset search pipelines ofers the potential to
improve both accuracy and eficiency, with planned evaluations to quantify their impact on
performance metrics. Explainable data summarization techniques will be explored to provide
transparent insights into the generation of data summaries and the rationale behind dataset
rankings, fostering trust and usability. Methods for content pattern analysis will be developed
to identify and utilize patterns within dataset content, improving retrieval accuracy. Expanding
the scope to multi-modal retrieval will address the need to handle diverse data types, including
images, videos, and audio, eficiently and at scale. Additionally, real-world deployment of these
systems will be prioritized to evaluate scalability and gather user feedback, guiding further
refinements and optimizations.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The author would like to express his thanks to his supervisor Prof. Gong Cheng for providing
helpful suggestions and comments. This work was supported by the NSFC (62072224).
[20] R. F. Nogueira, K. Cho, Passage re-ranking with BERT, CoRR abs/1901.04085 (2019). URL:
http://arxiv.org/abs/1901.04085. arXiv:1901.04085.
[21] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized
late interaction over BERT, in: Proceedings of the 43rd International ACM SIGIR conference
on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China,
July 25-30, 2020, ACM, 2020, pp. 39–48. URL: https://doi.org/10.1145/3397271.3401075.
doi:10.1145/3397271.3401075.
[22] A. Quarati, Open government data: Usage trends and metadata quality, J. Inf.</p>
      <p>Sci. 49 (2023) 887–910. URL: https://doi.org/10.1177/01655515211027775. doi:10.1177/
01655515211027775.
[23] M. A. Musen, M. J. O’Connor, E. Schultes, M. M. Romero, J. Hardi, J. Graybeal, Modeling
community standards for metadata as templates makes data FAIR, CoRR abs/2208.02836
(2022). URL: https://doi.org/10.48550/arXiv.2208.02836. doi:10.48550/ARXIV.2208.02836.
arXiv:2208.02836.
[24] Q. Chen, W. Luo, Z. Huang, T. Lin, X. Wang, A. Soylu, B. Ell, B. Zhou, E. Kharlamov,
G. Cheng, ACORDAR 2.0: A test collection for ad hoc dataset retrieval with densely
pooled datasets and question-style queries, in: Proceedings of the 47th International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024,
Washington DC, USA, July 14-18, 2024, ACM, 2024, pp. 303–312. URL: https://doi.org/10.
1145/3626772.3657866. doi:10.1145/3626772.3657866.
[25] Q. Chen, Z. Huang, Z. Zhang, W. Luo, T. Lin, Q. Shi, G. Cheng, Dense re-ranking with weak
supervision for RDF dataset search, in: The Semantic Web - ISWC 2023 - 22nd International
Semantic Web Conference, Athens, Greece, November 6-10, 2023, Proceedings, Part I,
volume 14265 of Lecture Notes in Computer Science, Springer, 2023, pp. 23–40. URL: https:
//doi.org/10.1007/978-3-031-47240-4_2. doi:10.1007/978-3-031-47240-4\_2.
[26] Q. Chen, J. Chen, X. Zhou, G. Cheng, Enhancing dataset search with compact data
snippets, in: Proceedings of the 47th International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18,
2024, ACM, 2024, pp. 1093–1103. URL: https://doi.org/10.1145/3626772.3657837. doi:10.
1145/3626772.3657837.
[27] G. Cheng, C. Jin, W. Ding, D. Xu, Y. Qu, Generating illustrative snippets for open data on the
web, in: Proceedings of the Tenth ACM International Conference on Web Search and Data
Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, ACM, 2017, pp.
151–159. URL: https://doi.org/10.1145/3018661.3018670. doi:10.1145/3018661.3018670.
[28] D. Liu, G. Cheng, Q. Liu, Y. Qu, Fast and practical snippet generation for RDF datasets,
ACM Trans. Web 13 (2019) 19:1–19:38. URL: https://doi.org/10.1145/3365575. doi:10.1145/
3365575.
[29] X. Wang, G. Cheng, T. Lin, J. Xu, J. Z. Pan, E. Kharlamov, Y. Qu, PCSG: pattern-coverage
snippet generation for RDF datasets, in: The Semantic Web - ISWC 2021 - 20th International
Semantic Web Conference, ISWC 2021, Virtual Event, October 24-28, 2021, Proceedings,
volume 12922 of Lecture Notes in Computer Science, Springer, 2021, pp. 3–20. URL: https:
//doi.org/10.1007/978-3-030-88361-4_1. doi:10.1007/978-3-030-88361-4\_1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Simperl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Koesten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Konstantinidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ibáñez</surname>
          </string-name>
          , E. Kacprzak,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Dataset search: a survey</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>251</fpage>
          -
          <lpage>272</lpage>
          . URL: https://doi.org/10.1007/ s00778-019-00564-x. doi:
          <volume>10</volume>
          .1007/S00778-019-00564-X.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brickley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burgess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <article-title>Google dataset search: Building a search engine for datasets in an open web ecosystem</article-title>
          ,
          <source>in: The World Wide Web Conference, WWW</source>
          <year>2019</year>
          , San Francisco, CA, USA, May
          <volume>13</volume>
          -17,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1365</fpage>
          -
          <lpage>1375</lpage>
          . URL: https: //doi.org/10.1145/3308558.3313685. doi:
          <volume>10</volume>
          .1145/3308558.3313685.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Cheng, A.
          <string-name>
            <surname>Soylu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
          </string-name>
          , E. Kharlamov,
          <article-title>ACORDAR: A test collection for ad hoc content-based (RDF) dataset retrieval</article-title>
          ,
          <source>in: SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Madrid, Spain,
          <source>July 11 - 15</source>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>2981</fpage>
          -
          <lpage>2991</lpage>
          . URL: https://doi.org/10.1145/3477495.3531729. doi:
          <volume>10</volume>
          .1145/3477495.3531729.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , G. Cheng, E. Kharlamov,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <article-title>Towards more usable dataset search: From query characterization to snippet generation</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2019</year>
          , Beijing, China, November 3-
          <issue>7</issue>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>2445</fpage>
          -
          <lpage>2448</lpage>
          . URL: https://doi.org/10.1145/ 3357384.3358096. doi:
          <volume>10</volume>
          .1145/3357384.3358096.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Dense text retrieval based on pretrained language models: A survey</article-title>
          ,
          <source>CoRR abs/2211</source>
          .14876 (
          <year>2022</year>
          ). URL: https://doi.org/10.48550/arXiv.2211.14876. doi:
          <volume>10</volume>
          .48550/ARXIV.2211.14876. arXiv:
          <volume>2211</volume>
          .
          <fpage>14876</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Benjelloun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <article-title>Google dataset search by the numbers</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2020 - 19th International Semantic Web Conference</source>
          , Athens, Greece, November 2-
          <issue>6</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>12507</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>667</fpage>
          -
          <lpage>682</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -62466-8_
          <fpage>41</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -62466-8\_
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Dataset discovery and exploration: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>56</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/3626521. doi:
          <volume>10</volume>
          .1145/3626521.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ohshima</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>H. O.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A test collection for ad-hoc dataset retrieval</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>2450</fpage>
          -
          <lpage>2456</lpage>
          . URL: https://doi.org/10.1145/3404835.3463261. doi:
          <volume>10</volume>
          .1145/3404835.3463261.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gururaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pournejati</surname>
          </string-name>
          , G. Alter,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>DemnerFushman</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Ohno-Machado</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge</article-title>
          ,
          <source>Database J. Biol. Databases Curation</source>
          <year>2017</year>
          (
          <year>2017</year>
          )
          <article-title>bax061</article-title>
          . URL: https://doi.org/10.1093/ database/bax061. doi:
          <volume>10</volume>
          .1093/DATABASE/BAX061.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Löfler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bruelheide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klan</surname>
          </string-name>
          ,
          <article-title>A test collection for dataset retrieval in biodiversity research</article-title>
          ,
          <source>Res. Ideas Outcomes</source>
          <volume>7</volume>
          (
          <year>2021</year>
          )
          <article-title>e67887</article-title>
          . doi:
          <volume>10</volume>
          .3897/rio. 7.e67887.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Data-driven domain discovery for structured datasets</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>953</fpage>
          -
          <lpage>965</lpage>
          . URL: http://www.vldb.org/pvldb/vol13/ p953-ota.pdf.
          <source>doi:10.14778/3384345</source>
          .3384346.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heflin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <article-title>Table search using a deep contextualized language model</article-title>
          ,
          <source>in: SIGIR</source>
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>589</fpage>
          -
          <lpage>598</lpage>
          . URL: https://doi.org/10.1145/ 3397271.3401044. doi:
          <volume>10</volume>
          .1145/3397271.3401044.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heflin</surname>
          </string-name>
          , Strubert:
          <article-title>Structure-aware BERT for table search and matching</article-title>
          ,
          <source>in: WWW</source>
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>442</fpage>
          -
          <lpage>451</lpage>
          . URL: https: //doi.org/10.1145/3485447.3511972. doi:
          <volume>10</volume>
          .1145/3485447.3511972.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Castelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rampin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S. R.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chirigati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <article-title>Auctus: A dataset search engine for data discovery and augmentation</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>2791</fpage>
          -
          <lpage>2794</lpage>
          . URL: http://www.vldb.org/pvldb/vol14/p2791-castelo.pdf.
          <source>doi:10.14778/ 3476311</source>
          .3476346.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pietriga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gözükan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Appert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Destandau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cebiric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Goasdoué</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Manolescu</surname>
          </string-name>
          ,
          <article-title>Browsing linked data catalogs with lodatlas</article-title>
          ,
          <source>in: ISWC</source>
          <year>2018</year>
          , volume
          <volume>11137</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>153</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>030</fpage>
          -00668-
          <issue>6</issue>
          _9. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00668-6\_9.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Luo</surname>
          </string-name>
          , G. Cheng, Y. Qu,
          <article-title>CKGSE: A prototype search engine for chinese knowledge graphs</article-title>
          ,
          <source>Data Intell</source>
          .
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>41</fpage>
          -
          <lpage>65</lpage>
          . URL: https://doi.org/10.1162/dint_a_00118. doi:
          <volume>10</volume>
          .1162/DINT\_A\_
          <volume>00118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          . URL: https: //doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .EMNLP-MAIN.
          <year>550</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Humeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lachaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Poly-encoders: Architectures and pretraining strategies for fast and accurate multi-sentence scoring</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=SkxgnnNFvH.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Condenser: a pre-training architecture for dense retrieval</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2021</year>
          , Virtual Event / Punta Cana, Dominican Republic,
          <fpage>7</fpage>
          -
          <issue>11</issue>
          <year>November</year>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>981</fpage>
          -
          <lpage>993</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          . emnlp-main.
          <volume>75</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2021</year>
          .EMNLP-MAIN.
          <year>75</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>