<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Unleashing the Potential of Data Lakes with Semantic Enrichment Using Foundation Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nandana Mihindukulasooriya</string-name>
          <email>nandana@ibm.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarthak Dash</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sugato Bagchi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faisal Chowdhury</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gliozzo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ariel Farkash</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Glass</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Gokhman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhan Pham</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaetano Rossiello</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boris Rozenberg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yehoshua Sagron</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dharmashankar Subramanian</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toshihiro Takahashi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takaaki Tateishi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Long Vu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IBM Research AI</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Data Lakes, Knowledge Graph, Semantic Enrichment, Large Language Models, Foundation Models</string-name>
        </contrib>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Nowadays most organizations are managing data lakes containing heterogeneous data from various sources. However, the lack of adequate metadata often transforms these data lakes into data swamps, making it challenging to locate relevant data for critical organizational tasks and consequently limiting their utility. Recent advancements in large language models and foundation models have enabled the automation of metadata generation using generative AI models and the use of generated metadata for mapping tabular data into semantically richer glossaries, taxonomies, or ontologies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
Introduction
CEUR
Workshop
Proceedings</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Challenge on Tabular Data to Knowledge Graph Matching (SemTav), use public domain data
that includes entities that can be linked to open knowledge graphs such as DBpedia or Wikidata.
With such settings, systems can link the cells to entities in KGs first and then perform enrichment
tasks based on those linked entities. Nevertheless, in an enterprise setting, there are a number
of additional challenges. First, the table and column names are often abbreviated using
dataowner-specific codes or acronyms with minimum or no textual descriptions. This makes it
harder to search and discover tables using keyword or semantic search.</p>
      <p>Furthermore, most organizations only permit semantic enrichment processes to access to
the table metadata such as column headers and not actual data (i.e., cell values) due to privacy
and access control regulations. Even when the data is available, they consist of entities that are
not present in public knowledge graphs, thus can not be linked. Therefore, in most industrial
settings, automatic metadata generation and mapping table columns to concepts using only
table metadata become necessary.</p>
      <p>Semantic Enrichment Process The inputs to our semantic enrichment process are a set
of table metadata (table names and column headers) from a data lake and a business glossary
that defines the concepts of interest to the organization. The semantic enrichment process
consists of three steps: (a) column name expansion; (b) table metadata enrichment; and (c)
column-to-concept mappings (also known as Column Type Annotation or CTA).</p>
      <p>The objective of the column name expansion step is to generate meaningful column names
for abbreviated and coded cryptic column names using adjacent column names of the same
table and table name as the context. The perplexity of language models along with some clues
from the business glossaries are used for the step. For table metadata generation, decoder-only
style auto-regressive models similar to GPT-4 / Llama 2 or encoder-decoder style
sequence-tosequence models similar to FLAN-T5 are trained either using public open data from portals such
as data.gov or industrial data when available. The column-to-concept mapping implementation
uses Sentence Transformer (SBERT) models to compute similarities between a column metadata
representation and a business glossary term representation. As a result of this process, the
tables in the data lakes be annotated with both human-readable descriptions and tags as well as
business concepts from glossaries.</p>
      <p>Table metadata can be represented as a knowledge graph containing tables and columns with
their relationships. Similarly, the glossary concepts can also be represented as a knowledge
graph where concepts are linked using relations such as subclass of or part of. Through the
process of semantic enrichment, we connect these two knowledge graphs together, unveiling
the semantics of the columns and enhancing their utility for downstream tasks.
Conclusion The advancements in large language models have enabled automatic metadata
generation for tables using generative AI models and mapping columns to concepts in glossaries
using models such as sentence transformers. By representing table metadata, business glossaries,
and the mappings between them in a single knowledge graph, downstream applications such as
table search and discovery or automatic table joins can utilise this information efectively. In
this talk, we plan to discuss the semantic enrichment challenges in an industrial setting, our
approach to addressing those challenges, lessons learned, and future directions.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>