-

1613-0073

Unleashing the Potential of Data Lakes with Semantic Enrichment Using Foundation Models

Nandana Mihindukulasooriya

nandana@ibm.com

Sarthak Dash

Sugato Bagchi

Faisal Chowdhury

Gliozzo

Ariel Farkash

Michael Glass

Igor Gokhman

Oktie Hassanzadeh

Nhan Pham

Gaetano Rossiello

Boris Rozenberg

Yehoshua Sagron

Dharmashankar Subramanian

Toshihiro Takahashi

Takaaki Tateishi

Long Vu

IBM Research AI

Data Lakes, Knowledge Graph, Semantic Enrichment, Large Language Models, Foundation Models

0000 0003

Nowadays most organizations are managing data lakes containing heterogeneous data from various sources. However, the lack of adequate metadata often transforms these data lakes into data swamps, making it challenging to locate relevant data for critical organizational tasks and consequently limiting their utility. Recent advancements in large language models and foundation models have enabled the automation of metadata generation using generative AI models and the use of generated metadata for mapping tabular data into semantically richer glossaries, taxonomies, or ontologies.

CEUR

ceur-ws.org Introduction CEUR Workshop Proceedings

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Challenge on Tabular Data to Knowledge Graph Matching (SemTav), use public domain data that includes entities that can be linked to open knowledge graphs such as DBpedia or Wikidata. With such settings, systems can link the cells to entities in KGs first and then perform enrichment tasks based on those linked entities. Nevertheless, in an enterprise setting, there are a number of additional challenges. First, the table and column names are often abbreviated using dataowner-specific codes or acronyms with minimum or no textual descriptions. This makes it harder to search and discover tables using keyword or semantic search.

Furthermore, most organizations only permit semantic enrichment processes to access to the table metadata such as column headers and not actual data (i.e., cell values) due to privacy and access control regulations. Even when the data is available, they consist of entities that are not present in public knowledge graphs, thus can not be linked. Therefore, in most industrial settings, automatic metadata generation and mapping table columns to concepts using only table metadata become necessary.

Semantic Enrichment Process The inputs to our semantic enrichment process are a set of table metadata (table names and column headers) from a data lake and a business glossary that defines the concepts of interest to the organization. The semantic enrichment process consists of three steps: (a) column name expansion; (b) table metadata enrichment; and (c) column-to-concept mappings (also known as Column Type Annotation or CTA).

The objective of the column name expansion step is to generate meaningful column names for abbreviated and coded cryptic column names using adjacent column names of the same table and table name as the context. The perplexity of language models along with some clues from the business glossaries are used for the step. For table metadata generation, decoder-only style auto-regressive models similar to GPT-4 / Llama 2 or encoder-decoder style sequence-tosequence models similar to FLAN-T5 are trained either using public open data from portals such as data.gov or industrial data when available. The column-to-concept mapping implementation uses Sentence Transformer (SBERT) models to compute similarities between a column metadata representation and a business glossary term representation. As a result of this process, the tables in the data lakes be annotated with both human-readable descriptions and tags as well as business concepts from glossaries.

Table metadata can be represented as a knowledge graph containing tables and columns with their relationships. Similarly, the glossary concepts can also be represented as a knowledge graph where concepts are linked using relations such as subclass of or part of. Through the process of semantic enrichment, we connect these two knowledge graphs together, unveiling the semantics of the columns and enhancing their utility for downstream tasks. Conclusion The advancements in large language models have enabled automatic metadata generation for tables using generative AI models and mapping columns to concepts in glossaries using models such as sentence transformers. By representing table metadata, business glossaries, and the mappings between them in a single knowledge graph, downstream applications such as table search and discovery or automatic table joins can utilise this information efectively. In this talk, we plan to discuss the semantic enrichment challenges in an industrial setting, our approach to addressing those challenges, lessons learned, and future directions.