Unleashing the Potential of Data Lakes with Semantic Enrichment Using Foundation Models Nandana Mihindukulasooriya† , Sarthak Dash, Sugato Bagchi, Faisal Chowdhury, Alfio Gliozzo, Ariel Farkash, Michael Glass, Igor Gokhman, Oktie Hassanzadeh, Nhan Pham, Gaetano Rossiello, Boris Rozenberg, Yehoshua Sagron, Dharmashankar Subramanian, Toshihiro Takahashi, Takaaki Tateishi and Long Vu IBM Research AI Abstract Nowadays most organizations are managing data lakes containing heterogeneous data from various sources. However, the lack of adequate metadata often transforms these data lakes into data swamps, making it challenging to locate relevant data for critical organizational tasks and consequently limiting their utility. Recent advancements in large language models and foundation models have enabled the automation of metadata generation using generative AI models and the use of generated metadata for mapping tabular data into semantically richer glossaries, taxonomies, or ontologies. In this talk, we will present a semantic enrichment process that generates table metadata such as descriptive table captions, tags, expanded column names, and column descriptions and then uses that information to map table columns to concepts in a given business glossary or an ontology. Furthermore, during this process, we represent both table metadata and business glossaries as knowledge graphs and connect them by mapping columns to business concepts. As a result, the enrichment process makes the data in data lakes more meaningful to the organization and enhances downstream tasks, including improved table search and discovery, efficient table joins, and advanced business analytics. Keywords Data Lakes, Knowledge Graph, Semantic Enrichment, Large Language Models, Foundation Models Introduction The use of data lakes by organizations to handle large volumes of structured, semi-structured, and unstructured data from multiple sources is becoming a common practice. Nevertheless, tables in these data lakes often suffer from issues such as abbreviated column names, missing table or column descriptions, tags and other metadata. Lack of adequate metadata in data lakes can limit their usefulness and hinder the relevant data from being found and utilized efficiently in downstream tasks. In this talk, we will present the current challenges of data lakes in an industrial setting and how we are addressing those challenges with a semantic enrichment process using both large language models as well as knowledge graphs. The current semantic enrichment academic benchmarks often do not sufficiently reflect the challenges in industrial data lakes. The majority of academic benchmarks for semantic enrichment tasks, such as the Column Type Annotation (CTA) task in the Semantic Web ISWC 2023 Industry Track, November 06–10, 2023, Athens, Greece † Corresponding author. Envelope-Open nandana@ibm.com (N. Mihindukulasooriya) Orcid 0000-0003-1707-4842 (N. Mihindukulasooriya) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Challenge on Tabular Data to Knowledge Graph Matching (SemTav), use public domain data that includes entities that can be linked to open knowledge graphs such as DBpedia or Wikidata. With such settings, systems can link the cells to entities in KGs first and then perform enrichment tasks based on those linked entities. Nevertheless, in an enterprise setting, there are a number of additional challenges. First, the table and column names are often abbreviated using data- owner-specific codes or acronyms with minimum or no textual descriptions. This makes it harder to search and discover tables using keyword or semantic search. Furthermore, most organizations only permit semantic enrichment processes to access to the table metadata such as column headers and not actual data (i.e., cell values) due to privacy and access control regulations. Even when the data is available, they consist of entities that are not present in public knowledge graphs, thus can not be linked. Therefore, in most industrial settings, automatic metadata generation and mapping table columns to concepts using only table metadata become necessary. Semantic Enrichment Process The inputs to our semantic enrichment process are a set of table metadata (table names and column headers) from a data lake and a business glossary that defines the concepts of interest to the organization. The semantic enrichment process consists of three steps: (a) column name expansion; (b) table metadata enrichment; and (c) column-to-concept mappings (also known as Column Type Annotation or CTA). The objective of the column name expansion step is to generate meaningful column names for abbreviated and coded cryptic column names using adjacent column names of the same table and table name as the context. The perplexity of language models along with some clues from the business glossaries are used for the step. For table metadata generation, decoder-only style auto-regressive models similar to GPT-4 / Llama 2 or encoder-decoder style sequence-to- sequence models similar to FLAN-T5 are trained either using public open data from portals such as data.gov or industrial data when available. The column-to-concept mapping implementation uses Sentence Transformer (SBERT) models to compute similarities between a column metadata representation and a business glossary term representation. As a result of this process, the tables in the data lakes be annotated with both human-readable descriptions and tags as well as business concepts from glossaries. Table metadata can be represented as a knowledge graph containing tables and columns with their relationships. Similarly, the glossary concepts can also be represented as a knowledge graph where concepts are linked using relations such as subclass of or part of. Through the process of semantic enrichment, we connect these two knowledge graphs together, unveiling the semantics of the columns and enhancing their utility for downstream tasks. Conclusion The advancements in large language models have enabled automatic metadata generation for tables using generative AI models and mapping columns to concepts in glossaries using models such as sentence transformers. By representing table metadata, business glossaries, and the mappings between them in a single knowledge graph, downstream applications such as table search and discovery or automatic table joins can utilise this information effectively. In this talk, we plan to discuss the semantic enrichment challenges in an industrial setting, our approach to addressing those challenges, lessons learned, and future directions.