<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TAFT: A Transformer-Based Approach for Format Transformation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erik Schönwälder</string-name>
          <email>erik.schoenwaelder@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julius Gonsior</string-name>
          <email>julius.gonsior@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anja Reusch</string-name>
          <email>anja.reusch@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Hartmann</string-name>
          <email>claudio.hartmann@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang Lehner</string-name>
          <email>wolfgang.lehner@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Research Group, Technische Universität Dresden</institution>
          ,
          <addr-line>Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The presence of heterogeneous data formats within data lakes poses challenges when attempting to analyze or further process such data. While data cleaning tools can remove heterogeneities within individual documents, they fail to address global format heterogeneities across multiple documents. For example, two documents store addresses each in a consistent format, thus not counting as a target for existing data cleaning tools. However, these consistent formats may still difer from each other, thereby posing global format heterogeneities. In order to close this gap, we present the framework TAFT (A Transformer-based Approach for Format Transformation), designed to remove these global format heterogeneities at scale, without human-in-the-loop involvement. To this end, we leverage a transformer-based model to convert the document columns into a uniform format based on types describing their content, such as Address or Name. With minimal configuration efort, we achieve state-of-the-art results without any further human intervention.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;format transformation</kwd>
        <kwd>data preparation</kwd>
        <kwd>heterogeneity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data lakes are widely deployed to collect data in a central repository without undergoing initial
processing in its original, raw format [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Target use-cases, like analyses, are often unknown during data
collection, leading to the majority of stored data being non-standardized, requiring a pre-processing
step before further processing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Due to this heterogeneity and particularly owing to the absence of a
standard representation throughout the data lake, a substantial portion of the data fails to be directly
consumable by downstream applications, such as analytical tools or management systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Aiming to make the data consumable for these applications, data scientists undertake a labor-intensive
process called data preparation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To enhance productivity, data cleaning tools like Raha [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
Wrangler [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] assist data scientists in this task, as demonstrated by the typical workflow depicted in
Fig. 1. After selecting relevant documents for analysis, the data scientist uses data cleaning tools to
clean and adjust the data. Starting with Document 1, the cleaning tool flags Lily Mia Smith as a pattern
violation due to its difering format and detects missing values in the Country column and outliers in the
Age column. After cleaning Document 1, the data scientist moves on to the next document and so forth.
While this workflow efectively removes local errors within individual documents, it fails to detect
global errors that become apparent only when considering all documents in a corpus. For example,
after correcting a pattern violation in the Name column, Document 1 shows no format heterogeneities.
However, inconsistencies may arise when comparing Document 1 with other documents. Document
N also has a Name column in a uniform format, but it does not match the format of Document 1
( ℎ .  ̸= ,  ℎ). Furthermore, the Country columns do not match in format either
(  ̸=  ). As a result, sorting, aggregating, or joining these documents becomes infeasible,
leading to an inability to apply data analysis flexibly on file-based data storage.
      </p>
      <p>
        Existing data cleaning tools, such as Raha [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or Wrangler [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] cannot detect these global errors
because they operate on a document-based level and from the perspective of a single document, a
uniformly formatted column appears to have no errors. As an alternative, extensive research (e.g.
Flashfill [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], UDATA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) has been conducted to fix format heterogeneities within the task of format
transformation. This task refers to the conversion of data of the same logical type into a uniform
representation. However, applying these systems globally is not feasible either, as they require user
involvement for each transformation, which is impractical and lacks scalability for thousands or millions
of documents.
      </p>
      <p>Addressing these issues, we propose TAFT1, an end-to-end framework designed to detect and correct
format heterogeneities in a global manner, with a specific emphasis on achieving this without
humanin-the-loop intervention.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TAFT: a framework to detect global heterogeneities</title>
      <p>The abstract architecture of TAFT is divided into two stages: the Detection Stage, which annotates
columns with types, and the Correction Stage, which converts these columns into a uniform format
predefined for each type.</p>
      <p>
        During the Detection Stage, columns in documents within a given document collection are annotated
with types that classify their content, such as Name, Country, or Address. This annotation is performed
by a Column Type Annotation (CTA) model, freely chosen by the data scientist, allowing for the
integration of corresponding research and new approaches into TAFT. For our experiments, we selected
the state-of-the-art model DODUO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as it outperforms other models like SATO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Beyond using
pretrained CTA models with a defined type set like DODUO, TAFT also supports the creation of custom
types and the training of the chosen CTA model for domain-specific adaptations. For example, a
telecommunications company deals with data that cannot be classified using a general type set. Here,
specific types such as IPv4, IPv6, RGB, or MAC addresses become relevant, as data of these types may
be inconsistently formatted. While using a pretrained CTA model is straightforward, tailoring it for
domain-specific contexts with custom types requires appropriate data.
      </p>
      <p>To eficiently generate such data, we use fuzzy generators that produce synthetic data samples sharing
specific traits, like Names or Addresses. For most types, Python or Java packages can easily build
fuzzy generators with minimal efort. Additionally, many packages, like random_address or names,
incorporate real-world data, extending the utility of fuzzy generators beyond synthetic data.</p>
      <p>In the Correction Stage, annotated columns are transformed into a uniform format predefined for
each type. To achieve this, a transformer-based model, specifically flan-t5-large from the FLAN-T5
family, is trained to perform the correction. To convert a column into the target format, the model takes
the column’s values, the desired format, and the task-specific prefix reshape: as input. For example,
DD/MM/YYYY reshape: November 01, 2002 [ROW] May 27, 1997 [ROW]. . . represents the</p>
      <sec id="sec-2-1">
        <title>1Code, data, and models are available at https://github.com/goodguyerik/TAFT</title>
        <p>input for a date column to be formatted as DD/MM/YYYY. The special token [ROW] separates individual
column values. As output, the model produces a text sequence, such as 01/11/2002 [ROW] 27/05/
1997 [ROW]. . . , containing the entered values transformed into the desired format. The generic
design of the data generation process from the detection stage can also be utilized to generate data for
the correction stage.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        We evaluated the FLAN-T5 model, which is designed to transform columns into a predefined uniform
format, by comparing it to two promising alternatives: UDATA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and FlashFill [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For our experimental
setup, we generated 1,000 test columns, each in both an input and output format, for eight distinct column
types (Address, Continent, Country, Date, Name, Phone, Sex, Unit) using our data generation method.
Fig. 2 shows the percentage of correctly transformed columns for each type and correction approach. As
demonstrated, our FLAN-T5 model remarkably outperforms UDATA and FlashFill, except in the Name
and Unit types. For the Continent, Country, and Sex types, which necessitate semantic understanding for
transformation, FLAN-T5 considerably outperforms both FlashFill and UDATA. For instance, converting
a country code into its corresponding country name requires semantic understanding of the correct
mapping, as seen with Germany → DEU. FLAN-T5 learns these mappings during training without
access to an external knowledge base, whereas UDATA and FlashFill rely on observing patterns using
provided examples, which do not exist. For the other types, which only require syntactic operations
such as changing the order of components, adding delimiters, or deleting components or parts thereof,
FLAN-T5 achieves results comparable to FlashFill, while UDATA either fails or only moderately performs
the desired transformations.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We proposed the two-stage framework TAFT, which addresses global format heterogeneities in data
lakes without human intervention. Our evaluation shows that the language model-based correction
approach, especially due to its semantic understanding, outperforms state-of-the-art systems while
being highly customizable. The framework frees data scientists from manually resolving global format
heterogeneities, allowing them to focus on other data preparation tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors gratefully acknowledge the computing time made available to them on the high-performance
computer at the NHR Center of TU Dresden. This center is jointly supported by the Federal Ministry of
Education and Research and the state governments participating in the NHR
(www.nhr-verein.de/unserepartner).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nambiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mundra</surname>
          </string-name>
          ,
          <article-title>An overview of data warehouse and data lake in modern enterprise data management</article-title>
          ,
          <source>Big Data and Cognitive Computing</source>
          <volume>6</volume>
          (
          <year>2022</year>
          )
          <article-title>132</article-title>
          . doi:
          <volume>10</volume>
          .3390/bdcc6040132.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data preparation: A survey of commercial tools</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>49</volume>
          (
          <year>2020</year>
          )
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          . doi:
          <volume>10</volume>
          .1145/3444831.3444835.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Castro</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Raha: A configuration-free error detection system</article-title>
          ,
          <source>in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>865</fpage>
          -
          <lpage>882</lpage>
          . URL: https://doi.org/10.1145/3299869.3324956. doi:
          <volume>10</volume>
          .1145/3299869.3324956.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <article-title>Wrangler: interactive visual specification of data transformation scripts</article-title>
          ,
          <source>in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>3363</fpage>
          -
          <lpage>3372</lpage>
          . URL: https://doi.org/10.1145/1978942.1979444. doi:
          <volume>10</volume>
          .1145/1978942.1979444.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <article-title>Automating string processing in spreadsheets using input-output examples</article-title>
          ,
          <source>SIGPLAN Not</source>
          .
          <volume>46</volume>
          (
          <year>2011</year>
          )
          <fpage>317</fpage>
          -
          <lpage>330</lpage>
          . URL: https://doi.org/10.1145/1925844.1926423. doi:
          <volume>10</volume>
          .1145/1925844. 1926423.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pujara</surname>
          </string-name>
          ,
          <article-title>Learning data transformations with minimal user efort</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>657</fpage>
          -
          <lpage>664</lpage>
          . doi:
          <volume>10</volume>
          .1109/ BigData47090.
          <year>2019</year>
          .
          <volume>9006350</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Suhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , c. Demiralp,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-C. Tan,
          <article-title>Annotating columns with pretrained language models</article-title>
          ,
          <source>in: Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>1493</fpage>
          -
          <lpage>1503</lpage>
          . URL: https://doi.org/10.1145/3514221.3517906. doi:
          <volume>10</volume>
          .1145/3514221.3517906.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          , Y. Suhara, c. Demiralp,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Sato: contextual semantic type detection in tables</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>1835</fpage>
          -
          <lpage>1848</lpage>
          . URL: https://doi.org/10.14778/3407790. 3407793. doi:
          <volume>10</volume>
          .14778/3407790.3407793.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>