<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LifeTabFusion: A Confidence-Guided Table Understanding via Hybrid Integration of KGs, ML, and LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vishvapalsinhji Parmar</string-name>
          <email>vishvapalsinhji.parmar@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alsayed Algergawy</string-name>
          <email>alsayed.algergawy@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Data and Knowledge Engineering, University of Passau</institution>
          ,
          <addr-line>Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapidly increasing volume of tabular data, there is a growing presence of semantic and structural heterogeneities, which presents significant challenges for efective table understanding in domains such as life sciences and biomedicine. Despite their structured appearance, tables often lack explicit semantics, making tasks like data integration, search, and knowledge graph construction challenging. In this paper, we introduce LifeTabFusion, a modular and forward-looking framework designed to support robust and scalable table understanding. The framework integrates three core components we have previously developed and evaluated individually on benchmark datasets: (i) domain-sensitive preprocessing for anomaly handling and normalization, (ii) lightweight machine learning models for schema-specific annotation, and (iii) scalable knowledge graph annotation via API-based lookups. Each module has demonstrated efectiveness across tasks such as Cell Entity Annotation (CEA), Column Type Annotation (CTA), and Column Property Annotation (CPA) using datasets from the SemTab challenge. Building on these foundations, LifeTabFusion proposes a hybrid architecture that selectively incorporates Large Language Models (LLMs) for contextual disambiguation and semantic enrichment. The final annotations are derived through a confidence-based fusion strategy that leverages the strengths of each component while minimizing individual weaknesses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Table Understanding</kwd>
        <kwd>Cell Entity Annotation</kwd>
        <kwd>Column Type Annotation</kwd>
        <kwd>Column Property Annotation</kwd>
        <kwd>Knowledge Graph Matching</kwd>
        <kwd>Tabular Data Annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Tables are among the most widely used formats for representing structured information, with
applications ranging from scientific publications and spreadsheets to open government data and biomedical
records. Their use dates back millennia, with one of the earliest known examples being a Sumerian
clay tablet from the ancient city of Shuruppag (ca. 2600 BCE), organized in a tabular format to record
data[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This early instance underscores the enduring importance of tabular representation in human
knowledge preservation .
      </p>
      <p>In the digital age, the use of tabular data has grown exponentially. By the end of 2028, the global
volume of data is projected to reach approximately 394 zettabytes1, a significant portion of which is
expected to be structured in tabular form. Despite their readability and eficiency for human users,
tables are often semantically ambiguous and lack the contextual information required for machines
to process them reliably. Variability in domain-specific content, structure, language, and formatting
further complicates their automated interpretation.</p>
      <p>
        Table understanding aims to address this gap. It refers to the automatic interpretation of a table’s
structure, content, and semantics. This includes tasks like detecting tables in documents, identifying
their functional elements (e.g., headers, stubs), and performing semantic interpretation, which involves
linking cells to entities, classifying column types, and identifying relationships between columns [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
These tasks form the backbone of a growing research field aiming to transform raw tabular data into
semantically rich knowledge representations. To advance research in this area, the SemTab challenge2
become a central benchmark for table-to-knowledge graph matching systems. It provides benchmark
datasets across domains such as food, biomedicine, and biodiversity, and invites systems to annotate
them using structured knowledge graphs like DBpedia, Schema.org, and Wikidata.
      </p>
      <p>
        Considering datasets from SemTab Challenge, we have recently developed systems addressing table
understanding through minimalist ML-based annotation targeting DBpedia and Schema.org,
domainsensitive preprocessing pipelines for enhanced accuracy, and eficient Wikidata-driven annotation
utilizing API-based lookup and caching [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. This paper consolidates insights from our previous
work into a forward-looking framework that strategically integrates preprocessing, knowledge graph
APIs, and emerging LLMs, employing LLMs selectively rather than end-to-end, to achieve robust and
scalable table understanding.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Table Understanding : An Overview</title>
      <sec id="sec-2-1">
        <title>2.1. Core Tasks in Semantic Table Interpretation</title>
        <p>The three fundamental tasks in semantic table interpretation are CEA, CTA, and CPA. Each task plays a
distinct role in transforming a plain table into a semantically meaningful representation. To illustrate
these tasks, consider the biomedical table shown in Table 1, which lists drugs, their biological target
proteins, and approval years.</p>
        <p>CEA links individual cell values to entities in a knowledge graph, enhancing semantic depth; for
example, the value “Aspirin” in the Drug Name column of Table 1 can be linked to the Wikidata3
entity wd:Q18216, and “COX-1” to wd:Q410251, representing cyclooxygenase-1. CTA assigns semantic
types to columns, such as identifying Drug Name as a subclass of Pharmaceutical Drug (wd:Q12140),
Target Protein as Protein family (wd:Q417841), and Approval Year as calendar year (wd:Q3186692). CPA
discovers relationships between columns using properties from a knowledge graph; for instance, Drug
Name and Approval Year may be linked by publication date (wdt:P577), while Drug Name and Target
Protein may use a domain-specific property like has target.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Techniques used for Semantic Table Interpretation</title>
        <p>
          The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) has been
running annually since 2019, promoting standardized evaluation of systems with tasks such as CEA,
CTA, and CPA. Systems submitted to SemTab use a wide range of strategies, Rule-based or heuristic
systems, JenTab used handcrafted rules to align columns and cells with entities and properties based
on label matching, type constraints, and schema-based heuristics[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Another system, MantisTable
uses heuristics and string similarity, column-type detection, and concept linking to interpret tables by
3https://www.wikidata.org/wiki/Wikidata:Main_Page
using resources like DBpedia and Wikidata [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Also there are some ML-based systems which shows
appropriate results. For instance, a system called TURL uses structure-aware Transformer encoder
tailored for tabular data [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. These systems often rely on token frequency, column uniqueness, or
embedding-based similarity. Knowledge Graph (KG)-driven system, SemTEX leveraged structured
lookups using DBpedia or Wikidata APIs to directly retrieve and rank candidate annotations using
gradient boosting [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Role of LLMs in Semantic Table Annotation</title>
      <p>To address the core challenges of data noise, limited interpretability, and scalability in semantic table
interpretation, our research contributes a modular pipeline combining domain-aware preprocessing,
minimalist ML models, and scalable KG-based entity linking. We also reflect on recent advances in
large language models LLMs, which ofer promising capabilities for hybrid frameworks. This section
ifrst summarizes our contributions, followed by a brief discussion of complementary LLM-based eforts.</p>
      <sec id="sec-3-1">
        <title>3.1. Modular and Scalable Table Annotation</title>
        <p>
          To address the challenges of noisy data, scalability, and interpretability in semantic table annotation,
we developed a modular pipeline comprising three components: (i) a ML-based structure annotation
module, (ii) a scalable knowledge graph-driven lookup system, and (iii) a domain-agnostic preprocessing
pipeline. Our first contribution, DREIFLUSS, introduced in SemTab 2023 Round 2 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], is a minimalist
logistic regression model tailored for CTA and CPA tasks. It employs count vectorized features
extracted from tabular content and uses stratified sampling to handle label imbalance when working with
Schema.org and DBpedia. Despite limited training data, DREIFLUSS demonstrated competitive
performance, particularly on the CPA (DBpedia) task. This work shows the eficacy of simplistic approach
obtaining competitive results with better sampling techniques. Building on this foundation, we
developed a scalable CEA system for SemTab 2024 that uses live Wikidata API calls to perform cell-to-entity
matching. To ensure throughput and robustness, we implemented multithreaded querying via Python’s
ThreadPoolExecutor, paired with caching mechanisms and a custom rate limiter to prevent API
throttling. This system eficiently handles large tables across life science domains such as biodiversity
and biomedicine[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Most recently, we introduced a domain-aware preprocessing pipeline [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which
performs anomaly detection (e.g., missing values, special symbols), normalization (e.g., multilingual
variant alignment, abbreviation expansion), and rule-based refinement prior to annotation. This step
alone yielded significant performance boosts for CEA across noisy datasets. A comprehensive summary
of F1 scores and performance improvements achieved by our system across diferent tasks and datasets
from the SemTab challenge is presented in Table 2. For the reproducibility of our work, we have made all
our systems available on GitHub such as DREIFLUSS4 as well as for Wikidata-driven annotation5 and for
preprocessing6. Together, these contributions reflect our commitment to building interpretable, eficient,
and domain-agnostic solutions for semantic table understanding. Each component is independently
deployable and complements the others, setting the stage for more integrated hybrid frameworks.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. LLM-based Annotation and Motivation for Hybrid Integration</title>
        <p>
          Recent advances demonstrate LLMs’ potential for semantic table interpretation, with systems like
CitySTI achieving efective cell-level entity disambiguation through LLM-based ranking and cleaning
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], while GPT-3 prompting reaches over 92% F1 across CEA, CTA, and topic detection in zero-shot
settings [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], and LLM-driven CPA methods show fine-tuned GPT-3.5 outperforming traditional ML
systems in column relationship identification [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Unlike traditional approaches requiring feature
engineering or API integration, LLMs leverage contextual signals from table structure, headers, and
4https://github.com/vishvapalsinh/cta-cpa-schemaorg-dbpedia
5https://github.com/vishvapalsinh/CEACTA24
6https://github.com/DKEPassau/PreprocessMatch
surrounding text, ofering enhanced capabilities for ambiguous, incomplete, or multilingual data.
Building on our specialized module results, we envision hybrid frameworks that fuse structured methods’
interpretability mentioned in the section above with LLMs’ contextual reasoning power, as outlined in
our proposed modular architecture.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Future Framework: A Modular Hybrid Architecture</title>
      <p>The proposed architecture, which we refer to as LifeTabFusion, is a modular and hybrid system designed
to address the complex challenges of semantic table understanding in real-world domains. The
framework integrates four main components developed through earlier work including domain-sensitive
preprocessing, parallelized semantic annotation via API calls, lightweight ML models, and LLMs for
disambiguation and interpretation.</p>
      <sec id="sec-4-1">
        <title>4.1. Pipeline of Framework</title>
        <p>The pipeline, as illustrated in Figure 1, begins with the ingestion of raw input tables in various formats,
including CSV, Excel, or tables extracted from PDFs. These input tables are initially processed through
a comprehensive preprocessing module that performs data standardization, anomaly detection, and
domain-aware cleaning, preprocessing steps that have been proven to significantly enhance downstream
annotation accuracy.</p>
        <p>
          Following preprocessing, the cleaned table data is directed through two parallel processing paths to
maximize eficiency and coverage. The first path employs a ML-based system that handles annotation
tasks using lightweight algorithms, specifically logistic regression and gradient boosting models trained
on knowledge graph features extracted from Schema.org and DBpedia. The second path implements
parallel knowledge graph-based entity annotation, where individual cell values are processed through
multi-threaded API calls to external knowledge graphs such as Wikidata. This approach incorporates
intelligent caching mechanisms and rate-limiting strategies to ensure high-throughput annotation
of large tables while maintaining annotation accuracy. In the subsequent enrichment phase, Large
Language Models, such as GPT-4, are employed to validate and enhance the results from both parallel
paths. The LLM layer serves multiple critical functions, resolving ambiguities in entity
disambiguation, performing zero-shot and few-shot predictions for missing type or property annotations, and
interpreting contextual information from neighboring cells, column headers, and table captions. This
contextual understanding enables more accurate and comprehensive annotations. The outputs from all
three components; ML models, KG annotations, and LLM enrichments which systematically integrated
through a fusion module that employs a confidence-based selection strategy[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This approach ensures
robust final annotations by leveraging the strengths of each component while mitigating individual
weaknesses. The annotated output can be exported in desired format. Despite seeming a promising
approach, the proposed framework is still in its early stages and using LLM will be costly. To reduce
the cost we need to explore the possibility of using open-source LLMs or fine-tuning smaller models on
domain-specific data, which might be not as efective as paid services.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 in order to: Grammar and spelling check.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Campbell-Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Robson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Croarken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Flood</surname>
          </string-name>
          ,
          <article-title>The history of mathematical tables : from sumer to spreadsheets</article-title>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Shigarov</surname>
          </string-name>
          ,
          <article-title>Table understanding: Problem overview</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>13</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Algergawy</surname>
          </string-name>
          ,
          <article-title>Dreifluss: A minimalist approach for table matching</article-title>
          , in: SemTab@ISWC,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Algergawy</surname>
          </string-name>
          ,
          <article-title>Wikidata-driven cea and cta for life sciences table matching extending dreifluss</article-title>
          , in: SemTab@ISWC,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Algergawy</surname>
          </string-name>
          ,
          <article-title>On the role of preprocessing on matching tables to knowledge graphs, in: EKAW-</article-title>
          <string-name>
            <surname>PDWT</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <article-title>Jentab meets semtab 2021's new challenges</article-title>
          , in: SemTab@ISWC,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cremaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Avogadro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chieregato</surname>
          </string-name>
          ,
          <article-title>Mantistable: an automatic approach for the semantic table interpretation</article-title>
          , in: SemTab@ISWC,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Turl: Table understanding through representation learning</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Henriksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Khorsid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Stück</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Sørensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelgrin</surname>
          </string-name>
          ,
          <article-title>Semtex: A hybrid approach for semantic table interpretation</article-title>
          , in: SemTab@ISWC,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. L. T.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <article-title>Citysti 2024 system: Tabular data to kg matching using llms</article-title>
          , in: SemTab@ISWC,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bikim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Atezong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiomekong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          , G. Rabby,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Leveraging gpt models for semantic table annotation</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Korini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>Column property annotation using large language models</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Betz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lüdtke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <article-title>Rule confidence aggregation for knowledge graph completion</article-title>
          ,
          <source>in: International Joint Conference on Rules and Reasoning</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>