Content-Based Dense Retrieval of Open Datasets

Content-Based Dense Retrieval of Open Datasets QiaoshengChen qschen@smail.nju.edu.cn State Key Laboratory for Novel Software Technology Nanjing University

Nanjing China

Content-Based Dense Retrieval of Open Datasets 1613-0073 195A498DD7A3E1C4B913EE7D257BC575 GROBID - A machine learning software for extracting information from scholarly documents Dataset Search Dense Retrieval Open Data

The rapid growth of open data has intensified the need for effective dataset search capabilities. This research proposal focuses on enhancing dataset search through content-based dense retrieval, addressing the limitations of current metadata-dependent systems. This research aims to tackle the challenges of dataset size, heterogeneity, and the creation of a comprehensive test collection for evaluation. The proposed research methods include data summarization techniques for large datasets and a unified representation of heterogeneous data, which are inspired by research related to the Semantic Web. Additionally, the research will explore a coarse-to-fine tuning strategy for dense retrieval models, leveraging data augmentation through distant supervision and self-training. The evaluation plan involves constructing a content-based test collection and comparing retrieval performance between metadata-only and content-enhanced approaches. The expected outcome is the development of effective content-based dataset search solutions, ultimately improving data findability.

Introduction

The availability and significance of open data have led to a surge in interest and reliance on dataset search within the field of information retrieval [1]. However, represented by Google Dataset Search [2], existing approaches and systems predominantly rely on metadata (descriptive text for dataset, such as title, description), which often suffers from low quality and limited availability. These metadata-based approaches have posed shortage in accurately capturing the relevance of datasets. For addressing the gap between users' real data needs and the quality of dataset metadata, it necessitates a shift towards content-based approaches that can effectively harness the richness of dataset content [3,4].

On the other hand, dense retrieval models, which have become mainstream in the field of document retrieval [5], have not yet been fully explored in the field of dataset search. In particular, how to apply dense retrieval models to content-based dataset search problems still faces many challenges. First, the large size of dataset content poses computational challenges, especially when it exceeds the processing capacity of standard dense retrieval models which are mainly based on pre-trained language models (PLMs). Additionally, the heterogeneity of dataset content, spanning various data formats and domains [6], further complicates the development of unified content-based search solutions.

The proposed research aims to contribute towards the development of robust and effective content-based solutions for dataset search, ultimately improving the findability and reusability of open datasets. Users across various domains, including researchers, data scientists, policymakers, and businesses, will benefit from content-based dataset search, while professional researchers in fields such as information retrieval, natural language processing (NLP), and machine learning are particularly invested in its advancement. Industries relying heavily on data-driven decisionmaking, such as healthcare, finance, agriculture, and environmental science, should also care about its development. Beyond the domain of information retrieval, this research involves technologies relevant to the Semantic Web and Knowledge Graph (KG). RDF datasets represent a significant part of open data. Moreover, employing ontologies or KGs as a framework can aid in analyzing the content of open datasets and processing heterogeneou data from a unified perspective. The advancement of dataset search also stands to catalyze the realization of findable, accessible, interoperable, and reusable (FAIR) open data within the Semantic Web community.

Related Work

In this section, we review recent advancements in dataset search and dense retrieval, highlighting limitations of current dataset search methods and examining strengths of dense retrieval techniques.

Dataset Search

Dataset search has garnered increasing attention with the proliferation of diverse and voluminous datasets, prompting the development of search approaches and systems [1,7]. Notably, Google Dataset Search [2] has paved the way as a pioneering dataset search engine, enabling keyword retrieval over published metadata of Web datasets. However, its reliance on metadata limits its effectiveness in supporting queries oriented towards dataset content. Moreover, existing dataset retrieval test collections [8,9,10] primarily depend on metadata annotations during construction, resulting in a lack of evaluation benchmarks for content-based dataset search.

Recent studies have highlighted the importance of integrating considerations for dataset content to enhance search effectiveness. Ota et al. [11] utilized value co-occurrence information within tabular datasets to infer attribute domains, while Chen et al. [12] proposed a BERT-based ranking model for table retrieval, focusing on selecting the most salient table items as representatives of the entire dataset. StruBERT [13] introduced a structure-aware BERT model to capture both structural and textual information of tabular datasets. Moreover, existing tabular dataset or RDF dataset search systems such as Auctus [14], LODAtlas [15], and CKGSE [16] leverage dataset content to augment retrieval capabilities and enhance user search experiences. However, these efforts primarily focus on single-format data, such as tabular or RDF data, overlooking the challenges posed by multi-format datasets.

Dense Retrieval

Recent advancements in dense retrieval have been significantly influenced by the incorporation of PLMs, which have demonstrated remarkable capabilities in capturing semantic nuances within text [5]. This approach, often referred to as dense retrieval, leverages the dense vector representations (embeddings) of text to facilitate semantic matching between queries and documents. Notably, Karpukhin et al. [17] presented dense passage retrieval (DPR) for opendomain question answering, highlighting the effectiveness of PLMs in this context. Their work has been seminal in shaping subsequent research. The concept of using multiple representations for improved text encoding has been explored by Humeau et al. [18] through their poly-encoder architectures, which allow for richer semantic interactions between queries and texts. The challenge of training efficient and robust dense retrievers has been addressed by various work. For instance, Gao and Callan [19] introduced Condenser, a pre-training architecture specifically designed to improve dense retrieval. Nogueira et al. [20] demonstrated the effectiveness of multistage document ranking using BERT, showcasing how PLMs can be effectively integrated into reranking stage. Furthermore, ColBERT [21] has provided insights into efficient and effective passage search through contextualized late interaction over BERT. Most of the current dense retrieval methods focus on retrieval of text documents or passages, whereas the structured content of datasets requires new dense model structures or retrieval strategies. The large data size and complex heterogeneity also make it difficult to directly treat the dataset content as plain text.

Problem Statement

In this section, we discuss the typical composition of datasets, outline the problem of contentbased dataset search, and formulate hypotheses and research questions derived from our investigation.

To clarify the distinction between dataset search and general document search, we first introduce the composition of a dataset, which consists of the following two parts:

• Metadata: This part includes descriptive fields provided by the dataset publisher, such as title, description, publishing organization, and other useful information about the dataset. • Data Files: A dataset consists of various data files, potentially in different formats. This research only considers textual data files, including unstructured TXT, PDF, and DOC files, as well as structured files such as graph data (RDF, OWL), tabular data (CSV, XLS), and key-value pair data (JSON, XML). Images (JPEG, PNG), videos (AVI, MPG), audios (WAV, MP3), and other non-textual formats are excluded from the scope of this research.

The research focuses on ad hoc dataset retrieval [8,3], the foundational form of dataset search. This process involves retrieving, from a collection 𝐷 of datasets, a ranked list of datasets ⟨𝑑 1 , 𝑑 2 , …⟩ that are most relevant to a keyword query 𝑞. The relevance assessment between query 𝑞 and each dataset 𝑑 ∈ 𝐷 is conducted independently of other datasets 𝑑 ′ ∈ 𝐷, where 𝑑 ≠ 𝑑 ′ . The primary objective is to compute the relevance score of each dataset 𝑑 ∈ 𝐷 to a given keyword query 𝑞. The prevalent dense retrieval paradigm typically employs a PLM as an encoder 𝐸(⋅) to encode a dataset 𝑑 and a query 𝑞 into vectors v 𝑑 and v 𝑞 respectively. Subsequently, it computes the similarity score between these vectors to gauge the relevance of 𝑑 to 𝑞.

According to studies on metadata quality [22,23], the metadata of open datasets on the Web often lacks guaranteed quality and is underutilized by both publishers and users. Meanwhile, dense retrieval models based on PLMs have exhibited increasingly powerful text understanding capabilities with advancements in NLP [5]. Hence, the application of dense retrieval models in dataset search becomes imperative. Based on these findings, we propose the following main hypothesis and research question:

Hypothesis. Dataset metadata quality often varies and may not fully describe the content. Users frequently seek information from the actual data files, and content-focused queries may not align well with the available metadata.

RQ0.

To what extent can content-based dense dataset retrieval methods outperform traditional metadata-centered approaches?

Building upon the hypothesis and RQ0, this research investigates the application of dense models to content-based dataset retrieval. Nonetheless, representing and indexing complex dataset content with PLM-based dense models presents substantial challenges. To address these challenges, we decompose RQ0 into the following four specific research questions:

RQ1. How to overcome the challenge presented by the extensive size of dataset content, especially when it exceeds the processing capacity of dense retrieval models?

RQ2. How to address the heterogeneity of dataset content, encompassing variations in formats?

RQ3. How to develop a dataset retrieval test collection which considers the content of datasets, rather than annotated solely relies on metadata?

RQ4. How to enhance the size and quality of existing public dataset retrieval test collections, particularly in terms of providing sufficient training data for dense retrieval models?

Research Methods

In this section, we provide a detailed and systematic research methodology that outlines how we address each research question (RQ1-RQ4) and validate our hypotheses [3,24,25,26]. The methodology is structured to ensure a comprehensive and coherent approach to solving the challenges of content-based dense retrieval of open datasets.

RQ1. To overcome the challenge posed by the large size of dataset content that exceeds the input capacity of PLMs, this research proposed an approach involving the extraction of a data summary for each dataset. Starting with RDF datasets, we introduced a technique to handle large RDF datasets by extracting a compact, representative subset of RDF triples [25]. This subset was selected to preserve the semantic integrity of the dataset and was used to create a document representation that fits within the token limit of dense ranking models. We employed two of the existing static RDF dataset summarization methods, IlluSnip [27,28] selecting top-ranked RDF triples covering the most frequent classes, properties, and entities, and PCSG [29] extracting a connected subgraph from an RDF graph covering as many data patterns as possible. Furthermore, we proposed a dynamic data summary extraction method for dataset search, selecting compact data snippets of appropriate size that are relevant to the user query [26]. By integrating these methods, one can create a compact, semantically representative, and query-biased data summary of the original dataset. This enables the use of PLMs for tasks such as dense dataset retrieval, where the models can process the summarized data to understand and rank datasets based on their relevance to user queries without being hindered by size limitations.

RQ2. We address the challenge of heterogeneity in dataset content by transforming data from various formats into a unified representation. The method establishes mapping rules for structured data, such as graph data, tabular data, and key-value pair data. These rules convert the heterogeneous data into unified data chunks. Each data chunk is modeled as a set of data triples, which consist of a subject, a predicate, and an object. This triple-structured format allows for uniform processing of all datasets, regardless of their original format. Converting different data formats into unified data chunks creates a consistent input for dense ranking models. This approach allows for the exploitation of heterogeneous data in dataset ranking, overcoming the limitations imposed by the diverse formats of open data. The summarized data chunks can then be used to rank datasets based on their relevance to a given query, thus enhancing the search accuracy and making the process more efficient. To ensure that the structural information is not lost during the conversion process, the mapping rules we employed preserve the hierarchical and relational aspects of the original data. For graph data, we maintain the relationships between nodes and edges by representing them as triples. For tabular data, we preserve the row-column structure by mapping rows to subjects and columns to predicates. For key-value pair data, we maintain the key-value relationships by representing keys as predicates and values as objects. This approach ensures that the structural integrity of the original data is preserved, which is beneficial for accurate retrieval. Additionally, we conduct experiments to evaluate the impact of content on retrieval performance, providing insights into the importance of preserving this information during the conversion process.

RQ3. We released a content-based RDF dataset retrieval test collection ACORDAR [3], and subsequently enhanced it to build ACORDAR 2.0 [24]. Constructing this content-based dataset retrieval test collection began with the collection of RDF datasets from various open data portals, ensuring a diverse and representative sample. Keyword queries were then formulated, either by analyzing user needs or through crowd-sourcing, resulting in a set of search terms that reflected actual information demands. To accommodate the complexity and size of datasets, a dashboard was developed to assist annotators in browsing and understanding the content of datasets. This tool was crucial for creating content-oriented queries and making informed relevance judgments. Annotators used the dashboard to analyze datasets and generate queries that capture the dataset's essence. These queries were then used to pool potentially relevant datasets, which were subsequently annotated for relevance. The pooling process was done using both sparse and dense retrieval models to ensure a broad coverage of potential matches. Relevance judgments were made on a graded scale, with annotators assessing the degree to which each dataset met the query's requirements. To ensure quality, annotations involved multiple annotators and a validation process. ACORDAR 2.0 was further enriched by transforming keyword queries into question-style queries using a large language model (LLM), which increased the diversity of the queries and simulates more natural information-seeking behavior. Our test collection provides a benchmark for evaluating content-based dataset retrieval systems. RQ4. To address the challenge of limited large labeled datasets necessary for training dense retrieval models, we proposed a coarse-to-fine tuning strategy [25]. This strategy involved an initial coarse-tuning phase with weak supervision obtained from a large set of automatically generated queries and relevance labels. It incorporated two data augmentation methods: distant supervision and self-training. In the distant supervision method, the title of each dataset served as a query, and the metadata document was assumed to be relevant to this query, thereby generating numerous labeled examples. Meanwhile, the self-training method employed datasetto-query generators trained on labeled data to generate queries from unlabeled data, further expanding the datasets for training dense models.

This systematic methodology ensures that each research question is addressed with a clear and structured approach, leading to the validation of our hypotheses and the development of effective content-based dataset search solutions.

Evaluation

The evaluation plan for this research involves constructing content-based dataset retrieval test collections [3,24] following the methodology outlined in Section 4. Dataset retrieval and reranking experiments will be conducted on these test collections, as well as on existing public dataset retrieval test collections [8]. Performance will be assessed using commonly used information retrieval metrics such as Recall, Normalized Discounted Cumulative Gain (NDCG), and Mean Average Precision (MAP). The primary objectives of these experiments are as follows:

1. To compare the retrieval performance using solely metadata against retrieval using metadata combined with content. 2. To assess the performance disparity between dense retrieval models and traditional sparse retrieval models in the dataset search scenario. 3. To analyze the impact of various data summarization methods for representing data content in dataset retrieval. 4. To investigate the effectiveness of different query types and characteristics in both metadata-based and content-based retrieval methods.

Comparison of Metadata-Only vs. Content-Enhanced Retrieval. We will conduct experiments to compare the retrieval performance of systems that use only metadata against those that combine metadata with content. This analysis will assess the extent to which contentbased retrieval improves search accuracy and relevance. We will examine performance metrics across various dataset types and query scenarios to identify specific cases where content-based retrieval provides significant advantages.

Performance Disparity Between Dense and Sparse Retrieval Models. We will compare the performance of dense retrieval models, which use PLMs, with traditional sparse retrieval models like BM25, which rely on term frequency-based scoring. This evaluation will highlight the strengths and limitations of dense retrieval models in dataset search. By analyzing their performance across diverse query types and datasets, we aim to identify scenarios where dense models excel, particularly in capturing semantic relevance, versus scenarios where sparse models may be more effective.

Impact of Data Summarization Methods. The role of data summarization methods in improving retrieval performance will be analyzed by testing both static techniques, such as IlluSnip and PCSG, and dynamic methods, which generate query-biased summaries. We will evaluate how these summarization approaches influence the relevance and efficiency of dataset retrieval. Additionally, we will explore the trade-offs between summarization quality and computational cost, providing insights into balancing performance with resource demands.

Query Type and Characteristic Analysis. A detailed examination of different query types and their characteristics will be conducted to understand their effectiveness in metadata-based and content-based retrieval methods. We hypothesize that specific queries requiring detailed content comprehension or precision may benefit more from content-based retrieval. On the other hand, more general queries or those that can be effectively addressed with metadata alone may exhibit similar performance across both approaches. This analysis will help refine retrieval strategies based on query requirements.

In addition, given that the eventual deployment of this work is envisioned in real-world dataset search applications, it is imperative to evaluate the time efficiency and additional space requirements of modules such as data summarization and dense retrieval models when processing real-world open datasets.

Conclusion and Future Work

The research proposal on content-based dense retrieval of open datasets is crucial in navigating the vast landscape of available data resources. By shifting the focus from metadata to the actual content of datasets, we can enhance search accuracy, ultimately facilitating more informed decision-making in data discovery. The long-term value of this research lies in its potential to streamline access to diverse datasets, empowering researchers, businesses, and policymakers with valuable insights.

Limitations and Challenges

Content-based dataset retrieval systems face several limitations and challenges. Time efficiency is a critical issue, as dense retrieval models and summarization techniques require significant computational resources, particularly when processing large datasets. Storage requirements are another concern, as pre-trained language models and their embeddings demand substantial space, making deployment difficult in resource-constrained environments. The heterogeneity and complexity of dataset formats further complicate retrieval, as it is challenging to develop unified solutions that preserve both structural and semantic information. Evaluation is also problematic, as constructing comprehensive and realistic test collections that reflect real-world scenarios is complex yet crucial for assessing system performance. Finally, query understanding remains a persistent challenge, particularly for complex queries that require detailed comprehension of dataset content to map them effectively to relevant datasets.

Future Research Directions

Future research will focus on several directions to overcome these challenges and enhance dataset retrieval systems. Integrating LLMs into dataset search pipelines offers the potential to improve both accuracy and efficiency, with planned evaluations to quantify their impact on performance metrics. Explainable data summarization techniques will be explored to provide transparent insights into the generation of data summaries and the rationale behind dataset rankings, fostering trust and usability. Methods for content pattern analysis will be developed to identify and utilize patterns within dataset content, improving retrieval accuracy. Expanding the scope to multi-modal retrieval will address the need to handle diverse data types, including images, videos, and audio, efficiently and at scale. Additionally, real-world deployment of these systems will be prioritized to evaluate scalability and gather user feedback, guiding further refinements and optimizations.

Acknowledgments

The author would like to express his thanks to his supervisor Prof. Gong Cheng for providing helpful suggestions and comments. This work was supported by the NSFC (62072224).

Dataset search: a survey AChapman ESimperl LKoesten GKonstantinidis LIbáñez EKacprzak PGroth 10.1007/S00778-019-00564-X VLDB J 29 2020 Google dataset search: Building a search engine for datasets in an open web ecosystem DBrickley MBurgess NFNoy 10.1145/3308558.3313685 doi:10.1145/3308558.3313685 The World Wide Web Conference, WWW 2019

San Francisco, CA, USA

ACM May 13-17, 2019. 2019 ACORDAR: A test collection for ad hoc content-based (RDF) dataset retrieval TLin QChen GCheng ASoylu BEll RZhao QShi XWang YGu EKharlamov 10.1145/3477495.3531729 doi:10.1145/3477495.3531729 SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Madrid, Spain

ACM July 11 -15, 2022. 2022 Towards more usable dataset search: From query characterization to snippet generation JChen XWang GCheng EKharlamov YQu 10.1145/3357384.3358096 doi:10.1145/3357384.3358096 Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019 the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019

Beijing, China

ACM November 3-7, 2019. 2019 Dense text retrieval based on pretrained language models: A survey WXZhao JLiu RRen JWen 10.48550/ARXIV.2211.14876 arXiv:2211.14876 2022 Google dataset search by the numbers OBenjelloun SChen NFNoy 10.1007/978-3-030-62466-8_41 doi: The Semantic Web -ISWC 2020 -19th International Semantic Web Conference Lecture Notes in Computer Science

Athens, Greece

Springer November 2-6, 2020. 2020 12507 Proceedings, Part II Dataset discovery and exploration: A survey NWPaton JChen ZWu 10.1145/3626521 ACM Comput. Surv 56 2023 A test collection for ad-hoc dataset retrieval MPKato HOhshima YLiu HOChen 10.1145/3404835.3463261 doi:10.1145/3404835.3463261 SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event

, Canada

ACM July 11-15, 2021. 2021 A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge TCohen KRoberts AEGururaj XChen SPournejati GAlter WRHersh DDemner-Fushman LOhno-Machado HXu 10.1093/DATABASE/BAX061 Database J. Biol. Databases Curation 2017 61 2017 A test collection for dataset retrieval in biodiversity research FLöffler ASchuldt BKönig-Ries HBruelheide FKlan 10.3897/rio.7.e67887 Res. Ideas Outcomes 7 e67887 2021 Data-driven domain discovery for structured datasets MOta HMueller JFreire DSrivastava 10.14778/3384345.3384346 Proc. VLDB Endow 13 2020 Table search using a deep contextualized language model ZChen MTrabelsi JHeflin YXu BDDavison 10.1145/3397271.3401044 doi:10.1145/3397271.3401044 SIGIR 2020 ACM 2020 Strubert: Structure-aware BERT for table search and matching MTrabelsi ZChen SZhang BDDavison JHeflin 10.1145/3485447.3511972 doi:10.1145/3485447.3511972 WWW 2022 ACM 2022 Auctus: A dataset search engine for data discovery and augmentation SCastelo RRampin AS RSantos ABessa FChirigati JFreire 10.14778/3476311.3476346 Proc. VLDB Endow VLDB Endow 2021 14 Browsing linked data catalogs with lodatlas EPietriga HGözükan CAppert MDestandau SCebiric FGoasdoué IManolescu 10.1007/978-3-030-00668-6_9 doi: ISWC 2018 Lecture Notes in Computer Science Springer 2018 11137 CKGSE: A prototype search engine for chinese knowledge graphs XWang TLin WLuo GCheng YQu 10.1162/DINT_A_00118 Data Intell 4 2022 Dense passage retrieval for open-domain question answering VKarpukhin BOguz SMin PS HLewis LWu SEdunov DChen WYih 10.18653/V1/2020.EMNLP-MAIN.550 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 November 16-20, 2020. 2020 Association for Computational Linguistics Poly-encoders: Architectures and pretraining strategies for fast and accurate multi-sentence scoring SHumeau KShuster MLachaux JWeston 8th International Conference on Learning Representations, ICLR 2020

Addis Ababa, Ethiopia

OpenReview April 26-30, 2020. 2020 Condenser: a pre-training architecture for dense retrieval LGao JCallan 10.18653/V1/2021.EMNLP-MAIN.75 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana

, Dominican Republic

7-11 November, 2021. 2021 Association for Computational Linguistics RFNogueira KCho CoRR abs/1901.04085 Passage re-ranking with BERT 2019 Colbert: Efficient and effective passage search via contextualized late interaction over BERT OKhattab MZaharia 10.1145/3397271.3401075 doi:10.1145/3397271.3401075 Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event

, China

ACM July 25-30, 2020. 2020 Open government data: Usage trends and metadata quality AQuarati 10.1177/01655515211027775 J. Inf. Sci 49 2023 Modeling community standards for metadata as templates makes data FAIR MAMusen MJO'connor ESchultes MMRomero JHardi J 10.48550/ARXIV.2208.02836 arXiv:2208.02836 2022 ACORDAR 2.0: A test collection for ad hoc dataset retrieval with densely pooled datasets and question-style queries QChen WLuo ZHuang TLin XWang ASoylu BEll BZhou EKharlamov GCheng 10.1145/3626772.3657866 doi:10.1145/3626772.3657866 Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024

Washington DC, USA

ACM July 14-18, 2024. 2024 Dense re-ranking with weak supervision for RDF dataset search QChen ZHuang ZZhang WLuo TLin QShi GCheng 10.1007/978-3-031-47240-4_2 doi: The Semantic Web -ISWC 2023 -22nd International Semantic Web Conference Lecture Notes in Computer Science

Athens, Greece

Springer November 6-10, 2023. 2023 14265 Proceedings, Part I Enhancing dataset search with compact data snippets QChen JChen XZhou GCheng 10.1145/3626772.3657837 doi:10. 1145/3626772.3657837 Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024

Washington DC, USA

ACM July 14-18, 2024. 2024 Generating illustrative snippets for open data on the web GCheng CJin WDing DXu YQu 10.1145/3018661.3018670 doi:10.1145/3018661.3018670 Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017 the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017

Cambridge, United Kingdom

ACM February 6-10, 2017. 2017 Fast and practical snippet generation for RDF datasets DLiu GCheng QLiu YQu 10.1145/3365575 ACM Trans. Web 13 38 2019 PCSG: pattern-coverage snippet generation for RDF datasets XWang GCheng TLin JXu JZPan EKharlamov YQu 10.1007/978-3-030-88361-4_1 doi: The Semantic Web -ISWC 2021 -20th International Semantic Web Conference, ISWC 2021, Virtual Event Lecture Notes in Computer Science Springer October 24-28, 2021. 2021 12922 Proceedings