<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Content-Based Dense Retrieval of Open Datasets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Qiaosheng</forename><surname>Chen</surname></persName>
							<email>qschen@smail.nju.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="laboratory">State Key Laboratory for Novel Software Technology</orgName>
								<orgName type="institution">Nanjing University</orgName>
								<address>
									<settlement>Nanjing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Content-Based Dense Retrieval of Open Datasets</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">195A498DD7A3E1C4B913EE7D257BC575</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Dataset Search</term>
					<term>Dense Retrieval</term>
					<term>Open Data</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The rapid growth of open data has intensified the need for effective dataset search capabilities. This research proposal focuses on enhancing dataset search through content-based dense retrieval, addressing the limitations of current metadata-dependent systems. This research aims to tackle the challenges of dataset size, heterogeneity, and the creation of a comprehensive test collection for evaluation. The proposed research methods include data summarization techniques for large datasets and a unified representation of heterogeneous data, which are inspired by research related to the Semantic Web. Additionally, the research will explore a coarse-to-fine tuning strategy for dense retrieval models, leveraging data augmentation through distant supervision and self-training. The evaluation plan involves constructing a content-based test collection and comparing retrieval performance between metadata-only and content-enhanced approaches. The expected outcome is the development of effective content-based dataset search solutions, ultimately improving data findability.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The availability and significance of open data have led to a surge in interest and reliance on dataset search within the field of information retrieval <ref type="bibr" target="#b0">[1]</ref>. However, represented by Google Dataset Search <ref type="bibr" target="#b1">[2]</ref>, existing approaches and systems predominantly rely on metadata (descriptive text for dataset, such as title, description), which often suffers from low quality and limited availability. These metadata-based approaches have posed shortage in accurately capturing the relevance of datasets. For addressing the gap between users' real data needs and the quality of dataset metadata, it necessitates a shift towards content-based approaches that can effectively harness the richness of dataset content <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>On the other hand, dense retrieval models, which have become mainstream in the field of document retrieval <ref type="bibr" target="#b4">[5]</ref>, have not yet been fully explored in the field of dataset search. In particular, how to apply dense retrieval models to content-based dataset search problems still faces many challenges. First, the large size of dataset content poses computational challenges, especially when it exceeds the processing capacity of standard dense retrieval models which are mainly based on pre-trained language models (PLMs). Additionally, the heterogeneity of dataset content, spanning various data formats and domains <ref type="bibr" target="#b5">[6]</ref>, further complicates the development of unified content-based search solutions.</p><p>The proposed research aims to contribute towards the development of robust and effective content-based solutions for dataset search, ultimately improving the findability and reusability of open datasets. Users across various domains, including researchers, data scientists, policymakers, and businesses, will benefit from content-based dataset search, while professional researchers in fields such as information retrieval, natural language processing (NLP), and machine learning are particularly invested in its advancement. Industries relying heavily on data-driven decisionmaking, such as healthcare, finance, agriculture, and environmental science, should also care about its development. Beyond the domain of information retrieval, this research involves technologies relevant to the Semantic Web and Knowledge Graph (KG). RDF datasets represent a significant part of open data. Moreover, employing ontologies or KGs as a framework can aid in analyzing the content of open datasets and processing heterogeneou data from a unified perspective. The advancement of dataset search also stands to catalyze the realization of findable, accessible, interoperable, and reusable (FAIR) open data within the Semantic Web community.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In this section, we review recent advancements in dataset search and dense retrieval, highlighting limitations of current dataset search methods and examining strengths of dense retrieval techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Dataset Search</head><p>Dataset search has garnered increasing attention with the proliferation of diverse and voluminous datasets, prompting the development of search approaches and systems <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b6">7]</ref>. Notably, Google Dataset Search <ref type="bibr" target="#b1">[2]</ref> has paved the way as a pioneering dataset search engine, enabling keyword retrieval over published metadata of Web datasets. However, its reliance on metadata limits its effectiveness in supporting queries oriented towards dataset content. Moreover, existing dataset retrieval test collections <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref> primarily depend on metadata annotations during construction, resulting in a lack of evaluation benchmarks for content-based dataset search.</p><p>Recent studies have highlighted the importance of integrating considerations for dataset content to enhance search effectiveness. Ota et al. <ref type="bibr" target="#b10">[11]</ref> utilized value co-occurrence information within tabular datasets to infer attribute domains, while Chen et al. <ref type="bibr" target="#b11">[12]</ref> proposed a BERT-based ranking model for table retrieval, focusing on selecting the most salient table items as representatives of the entire dataset. StruBERT <ref type="bibr" target="#b12">[13]</ref> introduced a structure-aware BERT model to capture both structural and textual information of tabular datasets. Moreover, existing tabular dataset or RDF dataset search systems such as Auctus <ref type="bibr" target="#b13">[14]</ref>, LODAtlas <ref type="bibr" target="#b14">[15]</ref>, and CKGSE <ref type="bibr" target="#b15">[16]</ref> leverage dataset content to augment retrieval capabilities and enhance user search experiences. However, these efforts primarily focus on single-format data, such as tabular or RDF data, overlooking the challenges posed by multi-format datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Dense Retrieval</head><p>Recent advancements in dense retrieval have been significantly influenced by the incorporation of PLMs, which have demonstrated remarkable capabilities in capturing semantic nuances within text <ref type="bibr" target="#b4">[5]</ref>. This approach, often referred to as dense retrieval, leverages the dense vector representations (embeddings) of text to facilitate semantic matching between queries and documents. Notably, Karpukhin et al. <ref type="bibr" target="#b16">[17]</ref> presented dense passage retrieval (DPR) for opendomain question answering, highlighting the effectiveness of PLMs in this context. Their work has been seminal in shaping subsequent research. The concept of using multiple representations for improved text encoding has been explored by Humeau et al. <ref type="bibr" target="#b17">[18]</ref> through their poly-encoder architectures, which allow for richer semantic interactions between queries and texts. The challenge of training efficient and robust dense retrievers has been addressed by various work. For instance, Gao and Callan <ref type="bibr" target="#b18">[19]</ref> introduced Condenser, a pre-training architecture specifically designed to improve dense retrieval. Nogueira et al. <ref type="bibr" target="#b19">[20]</ref> demonstrated the effectiveness of multistage document ranking using BERT, showcasing how PLMs can be effectively integrated into reranking stage. Furthermore, ColBERT <ref type="bibr" target="#b20">[21]</ref> has provided insights into efficient and effective passage search through contextualized late interaction over BERT. Most of the current dense retrieval methods focus on retrieval of text documents or passages, whereas the structured content of datasets requires new dense model structures or retrieval strategies. The large data size and complex heterogeneity also make it difficult to directly treat the dataset content as plain text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem Statement</head><p>In this section, we discuss the typical composition of datasets, outline the problem of contentbased dataset search, and formulate hypotheses and research questions derived from our investigation.</p><p>To clarify the distinction between dataset search and general document search, we first introduce the composition of a dataset, which consists of the following two parts:</p><p>• Metadata: This part includes descriptive fields provided by the dataset publisher, such as title, description, publishing organization, and other useful information about the dataset. • Data Files: A dataset consists of various data files, potentially in different formats. This research only considers textual data files, including unstructured TXT, PDF, and DOC files, as well as structured files such as graph data (RDF, OWL), tabular data (CSV, XLS), and key-value pair data (JSON, XML). Images (JPEG, PNG), videos (AVI, MPG), audios (WAV, MP3), and other non-textual formats are excluded from the scope of this research.</p><p>The research focuses on ad hoc dataset retrieval <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b2">3]</ref>, the foundational form of dataset search. This process involves retrieving, from a collection 𝐷 of datasets, a ranked list of datasets ⟨𝑑 1 , 𝑑 2 , …⟩ that are most relevant to a keyword query 𝑞. The relevance assessment between query 𝑞 and each dataset 𝑑 ∈ 𝐷 is conducted independently of other datasets 𝑑 ′ ∈ 𝐷, where 𝑑 ≠ 𝑑 ′ . The primary objective is to compute the relevance score of each dataset 𝑑 ∈ 𝐷 to a given keyword query 𝑞. The prevalent dense retrieval paradigm typically employs a PLM as an encoder 𝐸(⋅) to encode a dataset 𝑑 and a query 𝑞 into vectors v 𝑑 and v 𝑞 respectively. Subsequently, it computes the similarity score between these vectors to gauge the relevance of 𝑑 to 𝑞.</p><p>According to studies on metadata quality <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>, the metadata of open datasets on the Web often lacks guaranteed quality and is underutilized by both publishers and users. Meanwhile, dense retrieval models based on PLMs have exhibited increasingly powerful text understanding capabilities with advancements in NLP <ref type="bibr" target="#b4">[5]</ref>. Hence, the application of dense retrieval models in dataset search becomes imperative. Based on these findings, we propose the following main hypothesis and research question:</p><p>Hypothesis. Dataset metadata quality often varies and may not fully describe the content. Users frequently seek information from the actual data files, and content-focused queries may not align well with the available metadata.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RQ0.</head><p>To what extent can content-based dense dataset retrieval methods outperform traditional metadata-centered approaches?</p><p>Building upon the hypothesis and RQ0, this research investigates the application of dense models to content-based dataset retrieval. Nonetheless, representing and indexing complex dataset content with PLM-based dense models presents substantial challenges. To address these challenges, we decompose RQ0 into the following four specific research questions:</p><p>RQ1. How to overcome the challenge presented by the extensive size of dataset content, especially when it exceeds the processing capacity of dense retrieval models?</p><p>RQ2. How to address the heterogeneity of dataset content, encompassing variations in formats?</p><p>RQ3. How to develop a dataset retrieval test collection which considers the content of datasets, rather than annotated solely relies on metadata?</p><p>RQ4. How to enhance the size and quality of existing public dataset retrieval test collections, particularly in terms of providing sufficient training data for dense retrieval models?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Research Methods</head><p>In this section, we provide a detailed and systematic research methodology that outlines how we address each research question (RQ1-RQ4) and validate our hypotheses <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26]</ref>. The methodology is structured to ensure a comprehensive and coherent approach to solving the challenges of content-based dense retrieval of open datasets.</p><p>RQ1. To overcome the challenge posed by the large size of dataset content that exceeds the input capacity of PLMs, this research proposed an approach involving the extraction of a data summary for each dataset. Starting with RDF datasets, we introduced a technique to handle large RDF datasets by extracting a compact, representative subset of RDF triples <ref type="bibr" target="#b24">[25]</ref>. This subset was selected to preserve the semantic integrity of the dataset and was used to create a document representation that fits within the token limit of dense ranking models. We employed two of the existing static RDF dataset summarization methods, IlluSnip <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref> selecting top-ranked RDF triples covering the most frequent classes, properties, and entities, and PCSG <ref type="bibr" target="#b28">[29]</ref> extracting a connected subgraph from an RDF graph covering as many data patterns as possible. Furthermore, we proposed a dynamic data summary extraction method for dataset search, selecting compact data snippets of appropriate size that are relevant to the user query <ref type="bibr" target="#b25">[26]</ref>. By integrating these methods, one can create a compact, semantically representative, and query-biased data summary of the original dataset. This enables the use of PLMs for tasks such as dense dataset retrieval, where the models can process the summarized data to understand and rank datasets based on their relevance to user queries without being hindered by size limitations.</p><p>RQ2. We address the challenge of heterogeneity in dataset content by transforming data from various formats into a unified representation. The method establishes mapping rules for structured data, such as graph data, tabular data, and key-value pair data. These rules convert the heterogeneous data into unified data chunks. Each data chunk is modeled as a set of data triples, which consist of a subject, a predicate, and an object. This triple-structured format allows for uniform processing of all datasets, regardless of their original format. Converting different data formats into unified data chunks creates a consistent input for dense ranking models. This approach allows for the exploitation of heterogeneous data in dataset ranking, overcoming the limitations imposed by the diverse formats of open data. The summarized data chunks can then be used to rank datasets based on their relevance to a given query, thus enhancing the search accuracy and making the process more efficient. To ensure that the structural information is not lost during the conversion process, the mapping rules we employed preserve the hierarchical and relational aspects of the original data. For graph data, we maintain the relationships between nodes and edges by representing them as triples. For tabular data, we preserve the row-column structure by mapping rows to subjects and columns to predicates. For key-value pair data, we maintain the key-value relationships by representing keys as predicates and values as objects. This approach ensures that the structural integrity of the original data is preserved, which is beneficial for accurate retrieval. Additionally, we conduct experiments to evaluate the impact of content on retrieval performance, providing insights into the importance of preserving this information during the conversion process.</p><p>RQ3. We released a content-based RDF dataset retrieval test collection ACORDAR <ref type="bibr" target="#b2">[3]</ref>, and subsequently enhanced it to build ACORDAR 2.0 <ref type="bibr" target="#b23">[24]</ref>. Constructing this content-based dataset retrieval test collection began with the collection of RDF datasets from various open data portals, ensuring a diverse and representative sample. Keyword queries were then formulated, either by analyzing user needs or through crowd-sourcing, resulting in a set of search terms that reflected actual information demands. To accommodate the complexity and size of datasets, a dashboard was developed to assist annotators in browsing and understanding the content of datasets. This tool was crucial for creating content-oriented queries and making informed relevance judgments. Annotators used the dashboard to analyze datasets and generate queries that capture the dataset's essence. These queries were then used to pool potentially relevant datasets, which were subsequently annotated for relevance. The pooling process was done using both sparse and dense retrieval models to ensure a broad coverage of potential matches. Relevance judgments were made on a graded scale, with annotators assessing the degree to which each dataset met the query's requirements. To ensure quality, annotations involved multiple annotators and a validation process. ACORDAR 2.0 was further enriched by transforming keyword queries into question-style queries using a large language model (LLM), which increased the diversity of the queries and simulates more natural information-seeking behavior. Our test collection provides a benchmark for evaluating content-based dataset retrieval systems. RQ4. To address the challenge of limited large labeled datasets necessary for training dense retrieval models, we proposed a coarse-to-fine tuning strategy <ref type="bibr" target="#b24">[25]</ref>. This strategy involved an initial coarse-tuning phase with weak supervision obtained from a large set of automatically generated queries and relevance labels. It incorporated two data augmentation methods: distant supervision and self-training. In the distant supervision method, the title of each dataset served as a query, and the metadata document was assumed to be relevant to this query, thereby generating numerous labeled examples. Meanwhile, the self-training method employed datasetto-query generators trained on labeled data to generate queries from unlabeled data, further expanding the datasets for training dense models.</p><p>This systematic methodology ensures that each research question is addressed with a clear and structured approach, leading to the validation of our hypotheses and the development of effective content-based dataset search solutions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Evaluation</head><p>The evaluation plan for this research involves constructing content-based dataset retrieval test collections <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b23">24]</ref> following the methodology outlined in Section 4. Dataset retrieval and reranking experiments will be conducted on these test collections, as well as on existing public dataset retrieval test collections <ref type="bibr" target="#b7">[8]</ref>. Performance will be assessed using commonly used information retrieval metrics such as Recall, Normalized Discounted Cumulative Gain (NDCG), and Mean Average Precision (MAP). The primary objectives of these experiments are as follows:</p><p>1. To compare the retrieval performance using solely metadata against retrieval using metadata combined with content. 2. To assess the performance disparity between dense retrieval models and traditional sparse retrieval models in the dataset search scenario. 3. To analyze the impact of various data summarization methods for representing data content in dataset retrieval. 4. To investigate the effectiveness of different query types and characteristics in both metadata-based and content-based retrieval methods.</p><p>Comparison of Metadata-Only vs. Content-Enhanced Retrieval. We will conduct experiments to compare the retrieval performance of systems that use only metadata against those that combine metadata with content. This analysis will assess the extent to which contentbased retrieval improves search accuracy and relevance. We will examine performance metrics across various dataset types and query scenarios to identify specific cases where content-based retrieval provides significant advantages.</p><p>Performance Disparity Between Dense and Sparse Retrieval Models. We will compare the performance of dense retrieval models, which use PLMs, with traditional sparse retrieval models like BM25, which rely on term frequency-based scoring. This evaluation will highlight the strengths and limitations of dense retrieval models in dataset search. By analyzing their performance across diverse query types and datasets, we aim to identify scenarios where dense models excel, particularly in capturing semantic relevance, versus scenarios where sparse models may be more effective.</p><p>Impact of Data Summarization Methods. The role of data summarization methods in improving retrieval performance will be analyzed by testing both static techniques, such as IlluSnip and PCSG, and dynamic methods, which generate query-biased summaries. We will evaluate how these summarization approaches influence the relevance and efficiency of dataset retrieval. Additionally, we will explore the trade-offs between summarization quality and computational cost, providing insights into balancing performance with resource demands.</p><p>Query Type and Characteristic Analysis. A detailed examination of different query types and their characteristics will be conducted to understand their effectiveness in metadata-based and content-based retrieval methods. We hypothesize that specific queries requiring detailed content comprehension or precision may benefit more from content-based retrieval. On the other hand, more general queries or those that can be effectively addressed with metadata alone may exhibit similar performance across both approaches. This analysis will help refine retrieval strategies based on query requirements.</p><p>In addition, given that the eventual deployment of this work is envisioned in real-world dataset search applications, it is imperative to evaluate the time efficiency and additional space requirements of modules such as data summarization and dense retrieval models when processing real-world open datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>The research proposal on content-based dense retrieval of open datasets is crucial in navigating the vast landscape of available data resources. By shifting the focus from metadata to the actual content of datasets, we can enhance search accuracy, ultimately facilitating more informed decision-making in data discovery. The long-term value of this research lies in its potential to streamline access to diverse datasets, empowering researchers, businesses, and policymakers with valuable insights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Limitations and Challenges</head><p>Content-based dataset retrieval systems face several limitations and challenges. Time efficiency is a critical issue, as dense retrieval models and summarization techniques require significant computational resources, particularly when processing large datasets. Storage requirements are another concern, as pre-trained language models and their embeddings demand substantial space, making deployment difficult in resource-constrained environments. The heterogeneity and complexity of dataset formats further complicate retrieval, as it is challenging to develop unified solutions that preserve both structural and semantic information. Evaluation is also problematic, as constructing comprehensive and realistic test collections that reflect real-world scenarios is complex yet crucial for assessing system performance. Finally, query understanding remains a persistent challenge, particularly for complex queries that require detailed comprehension of dataset content to map them effectively to relevant datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Future Research Directions</head><p>Future research will focus on several directions to overcome these challenges and enhance dataset retrieval systems. Integrating LLMs into dataset search pipelines offers the potential to improve both accuracy and efficiency, with planned evaluations to quantify their impact on performance metrics. Explainable data summarization techniques will be explored to provide transparent insights into the generation of data summaries and the rationale behind dataset rankings, fostering trust and usability. Methods for content pattern analysis will be developed to identify and utilize patterns within dataset content, improving retrieval accuracy. Expanding the scope to multi-modal retrieval will address the need to handle diverse data types, including images, videos, and audio, efficiently and at scale. Additionally, real-world deployment of these systems will be prioritized to evaluate scalability and gather user feedback, guiding further refinements and optimizations.</p></div>		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The author would like to express his thanks to his supervisor Prof. Gong Cheng for providing helpful suggestions and comments. This work was supported by the NSFC (62072224).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Dataset search: a survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Simperl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Koesten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Konstantinidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ibáñez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kacprzak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Groth</surname></persName>
		</author>
		<idno type="DOI">10.1007/S00778-019-00564-X</idno>
		<ptr target="https://doi.org/10.1007/s00778-019-00564-x.doi:10.1007/S00778-019-00564-X" />
	</analytic>
	<monogr>
		<title level="j">VLDB J</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="251" to="272" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Google dataset search: Building a search engine for datasets in an open web ecosystem</title>
		<author>
			<persName><forename type="first">D</forename><surname>Brickley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Burgess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
		<idno type="DOI">10.1145/3308558.3313685</idno>
		<idno>doi:10.1145/3308558.3313685</idno>
		<ptr target="https://doi.org/10.1145/3308558.3313685" />
	</analytic>
	<monogr>
		<title level="m">The World Wide Web Conference, WWW 2019</title>
				<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2019">May 13-17, 2019. 2019</date>
			<biblScope unit="page" from="1365" to="1375" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">ACORDAR: A test collection for ad hoc content-based (RDF) dataset retrieval</title>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kharlamov</surname></persName>
		</author>
		<idno type="DOI">10.1145/3477495.3531729</idno>
		<idno>doi:10.1145/3477495.3531729</idno>
		<ptr target="https://doi.org/10.1145/3477495.3531729" />
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting><address><addrLine>Madrid, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2022">July 11 -15, 2022. 2022</date>
			<biblScope unit="page" from="2981" to="2991" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Towards more usable dataset search: From query characterization to snippet generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kharlamov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3357384.3358096</idno>
		<idno>doi:10.1145/3357384.3358096</idno>
		<ptr target="https://doi.org/10.1145/3357384.3358096" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019</title>
				<meeting>the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2019">November 3-7, 2019. 2019</date>
			<biblScope unit="page" from="2445" to="2448" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Dense text retrieval based on pretrained language models: A survey</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2211.14876</idno>
		<idno type="arXiv">arXiv:2211.14876</idno>
		<ptr target="/ARXIV.2211.14876" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Google dataset search by the numbers</title>
		<author>
			<persName><forename type="first">O</forename><surname>Benjelloun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-62466-8_41</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-030-62466-8\_41" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2020 -19th International Semantic Web Conference</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Athens, Greece</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">November 2-6, 2020. 2020</date>
			<biblScope unit="volume">12507</biblScope>
			<biblScope unit="page" from="667" to="682" />
		</imprint>
	</monogr>
	<note>Proceedings, Part II</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Dataset discovery and exploration: A survey</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">W</forename><surname>Paton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3626521</idno>
		<ptr target="https://doi.org/10.1145/3626521.doi:10.1145/3626521" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A test collection for ad-hoc dataset retrieval</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Kato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ohshima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">O</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.1145/3404835.3463261</idno>
		<idno>doi:10.1145/3404835.3463261</idno>
		<ptr target="https://doi.org/10.1145/3404835.3463261" />
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event</title>
				<meeting><address><addrLine>, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2021">July 11-15, 2021. 2021</date>
			<biblScope unit="page" from="2450" to="2456" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge</title>
		<author>
			<persName><forename type="first">T</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Gururaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pournejati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Alter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">R</forename><surname>Hersh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Demner-Fushman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ohno-Machado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<idno type="DOI">10.1093/DATABASE/BAX061</idno>
		<ptr target="https://doi.org/10.1093/database/bax061.doi:10.1093/DATABASE/BAX061" />
	</analytic>
	<monogr>
		<title level="j">Database J. Biol. Databases Curation</title>
		<imprint>
			<biblScope unit="volume">2017</biblScope>
			<biblScope unit="page">61</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A test collection for dataset retrieval in biodiversity research</title>
		<author>
			<persName><forename type="first">F</forename><surname>Löffler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schuldt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>König-Ries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bruelheide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Klan</surname></persName>
		</author>
		<idno type="DOI">10.3897/rio.7.e67887</idno>
	</analytic>
	<monogr>
		<title level="j">Res. Ideas Outcomes</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">e67887</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Data-driven domain discovery for structured datasets</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mueller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Freire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Srivastava</surname></persName>
		</author>
		<idno type="DOI">10.14778/3384345.3384346</idno>
		<ptr target="http://www.vldb.org/pvldb/vol13/p953-ota.pdf.doi:10.14778/3384345.3384346" />
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="953" to="965" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Table search using a deep contextualized language model</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Trabelsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heflin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Davison</surname></persName>
		</author>
		<idno type="DOI">10.1145/3397271.3401044</idno>
		<idno>doi:10.1145/3397271.3401044</idno>
		<ptr target="https://doi.org/10.1145/3397271.3401044" />
	</analytic>
	<monogr>
		<title level="m">SIGIR 2020</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="589" to="598" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Strubert: Structure-aware BERT for table search and matching</title>
		<author>
			<persName><forename type="first">M</forename><surname>Trabelsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Davison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heflin</surname></persName>
		</author>
		<idno type="DOI">10.1145/3485447.3511972</idno>
		<idno>doi:10.1145/3485447.3511972</idno>
		<ptr target="https://doi.org/10.1145/3485447.3511972" />
	</analytic>
	<monogr>
		<title level="m">WWW 2022</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="442" to="451" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Auctus: A dataset search engine for data discovery and augmentation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Castelo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rampin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S R</forename><surname>Santos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bessa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Chirigati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Freire</surname></persName>
		</author>
		<idno type="DOI">10.14778/3476311.3476346</idno>
		<ptr target="http://www.vldb.org/pvldb/vol14/p2791-castelo.pdf.doi:10.14778/3476311.3476346" />
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="2791" to="2794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Browsing linked data catalogs with lodatlas</title>
		<author>
			<persName><forename type="first">E</forename><surname>Pietriga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gözükan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Appert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Destandau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cebiric</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-00668-6_9</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-030-00668-6\_9" />
	</analytic>
	<monogr>
		<title level="m">ISWC 2018</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">11137</biblScope>
			<biblScope unit="page" from="137" to="153" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">CKGSE: A prototype search engine for chinese knowledge graphs</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1162/DINT_A_00118</idno>
		<ptr target="https://doi.org/10.1162/dint_a_00118.doi:10.1162/DINT\_A\_00118" />
	</analytic>
	<monogr>
		<title level="j">Data Intell</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="41" to="65" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Dense passage retrieval for open-domain question answering</title>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Oguz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S H</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yih</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2020.EMNLP-MAIN.550</idno>
		<ptr target="https://doi.org/10.18653/v1/2020.emnlp-main.550.doi:10.18653/V1/2020.EMNLP-MAIN.550" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020</title>
				<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020</meeting>
		<imprint>
			<date type="published" when="2020">November 16-20, 2020. 2020</date>
			<biblScope unit="page" from="6769" to="6781" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Poly-encoders: Architectures and pretraining strategies for fast and accurate multi-sentence scoring</title>
		<author>
			<persName><forename type="first">S</forename><surname>Humeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=SkxgnnNFvH" />
	</analytic>
	<monogr>
		<title level="m">8th International Conference on Learning Representations, ICLR 2020</title>
				<meeting><address><addrLine>Addis Ababa, Ethiopia</addrLine></address></meeting>
		<imprint>
			<publisher>OpenReview</publisher>
			<date type="published" when="2020">April 26-30, 2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Condenser: a pre-training architecture for dense retrieval</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Callan</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2021.EMNLP-MAIN.75</idno>
		<ptr target="https://doi.org/10.18653/v1/2021.emnlp-main.75.doi:10.18653/V1/2021.EMNLP-MAIN.75" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana<address><addrLine>, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021-11-11">7-11 November, 2021. 2021</date>
			<biblScope unit="page" from="981" to="993" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">F</forename><surname>Nogueira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<idno>CoRR abs/1901.04085</idno>
		<ptr target="http://arxiv.org/abs/1901.04085.arXiv:1901.04085" />
		<title level="m">Passage re-ranking with BERT</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Colbert: Efficient and effective passage search via contextualized late interaction over BERT</title>
		<author>
			<persName><forename type="first">O</forename><surname>Khattab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zaharia</surname></persName>
		</author>
		<idno type="DOI">10.1145/3397271.3401075</idno>
		<idno>doi:10.1145/3397271.3401075</idno>
		<ptr target="https://doi.org/10.1145/3397271.3401075" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event</title>
				<meeting>the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event<address><addrLine>, China</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2020">July 25-30, 2020. 2020</date>
			<biblScope unit="page" from="39" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Open government data: Usage trends and metadata quality</title>
		<author>
			<persName><forename type="first">A</forename><surname>Quarati</surname></persName>
		</author>
		<idno type="DOI">10.1177/01655515211027775</idno>
		<ptr target="https://doi.org/10.1177/01655515211027775.doi:10.1177/01655515211027775" />
	</analytic>
	<monogr>
		<title level="j">J. Inf. Sci</title>
		<imprint>
			<biblScope unit="volume">49</biblScope>
			<biblScope unit="page" from="887" to="910" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Modeling community standards for metadata as templates makes data FAIR</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Musen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schultes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Romero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2208.02836</idno>
		<idno type="arXiv">arXiv:2208.02836</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2208.02836.doi:10.48550/ARXIV.2208.02836" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">ACORDAR 2.0: A test collection for ad hoc dataset retrieval with densely pooled datasets and question-style queries</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kharlamov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="DOI">10.1145/3626772.3657866</idno>
		<idno>doi:10.1145/3626772.3657866</idno>
		<ptr target="https://doi.org/10.1145/3626772.3657866" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024</title>
				<meeting>the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024<address><addrLine>Washington DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2024">July 14-18, 2024. 2024</date>
			<biblScope unit="page" from="303" to="312" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Dense re-ranking with weak supervision for RDF dataset search</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-47240-4_2</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-031-47240-4\_2" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2023 -22nd International Semantic Web Conference</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Athens, Greece</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">November 6-10, 2023. 2023</date>
			<biblScope unit="volume">14265</biblScope>
			<biblScope unit="page" from="23" to="40" />
		</imprint>
	</monogr>
	<note>Proceedings, Part I</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Enhancing dataset search with compact data snippets</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="DOI">10.1145/3626772.3657837</idno>
		<idno>doi:10. 1145/3626772.3657837</idno>
		<ptr target="https://doi.org/10.1145/3626772.3657837" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024</title>
				<meeting>the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024<address><addrLine>Washington DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2024">July 14-18, 2024. 2024</date>
			<biblScope unit="page" from="1093" to="1103" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Generating illustrative snippets for open data on the web</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3018661.3018670</idno>
		<idno>doi:10.1145/3018661.3018670</idno>
		<ptr target="https://doi.org/10.1145/3018661.3018670" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017</title>
				<meeting>the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017<address><addrLine>Cambridge, United Kingdom</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">February 6-10, 2017. 2017</date>
			<biblScope unit="page" from="151" to="159" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Fast and practical snippet generation for RDF datasets</title>
		<author>
			<persName><forename type="first">D</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3365575</idno>
		<ptr target="https://doi.org/10.1145/3365575.doi:10.1145/3365575" />
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Web</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page">38</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">PCSG: pattern-coverage snippet generation for RDF datasets</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kharlamov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-88361-4_1</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-030-88361-4\_1" />
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2021 -20th International Semantic Web Conference, ISWC 2021, Virtual Event</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">October 24-28, 2021. 2021</date>
			<biblScope unit="volume">12922</biblScope>
			<biblScope unit="page" from="3" to="20" />
		</imprint>
	</monogr>
	<note>Proceedings</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
