<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Farran-Codina, M. Urpí-Sardà, The power of databases in unraveling the nutrition-health
connection, Nutrients</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Verification of LLM-Powered Structured Data Extraction from Image Files</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katherine Thornton</string-name>
          <email>katherine.thornton@yale.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenneth Seals-Nutt</string-name>
          <email>kenneth@seals-nutt.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mika Matsuzaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcel Nguemaha</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johns Hopkins Bloomberg School of Public Health</institution>
          ,
          <addr-line>615 N Wolfe St, Baltimore, MD 21205</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>WikiFCD Collaborative</institution>
          ,
          <addr-line>New York, New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>WikiFCD Collaborative</institution>
          ,
          <addr-line>Olympia, WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>17</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Many food composition tables are published as PDF files. Data in these files are relevant to researchers looking into the change of food composition values over time, regional diferences in values and other questions. While there are software tools to extract structured data from more recent versions of PDF, these tools do not support the earliest versions of PDF. We tested an LLM-based workflow for creating CSV files of structured data extracted from legacy PDF files which we converted to PNG files. We manually reviewed each value reported by the LLM to determine the suitability of this approach. We found multiple inaccurate values in the dataset extracted by the LLM. While this approach is insuficient as a stand-alone method, we discuss potential for human-in-the-loop workflows to leverage the power of LLMs to assist with data extraction from legacy versions of PDF files.</p>
      </abstract>
      <kwd-group>
        <kwd>Food Composition</kwd>
        <kwd>Nutri-informatics</kwd>
        <kwd>Wikibase</kwd>
        <kwd>Wikidata</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Researchers interested in the nutritional composition of foods commonly eaten in Cameroon may need
to consult the food composition tables published in 1957 and 1966. Each of these tables was published
in print first, and also are available digitally in the Portable Document Format (PDF). As members of
the WikiFCD community, we aim to make food composition data available in our knowledge base1.
After adding data from a food composition table (FCT) to WikiFCD they can be combined with other
data on the web more easily. Querying WikiFCD allows users to ask questions about food data across
multiple food composition tables at once.</p>
      <p>We value food composition tables that were published decades ago because they provide data useful
for comparison with food composition values measured more recently. In some cases these older FCTs
may be the only data source for a particular region of the world. While these older FCTs are important
sources for WikiFCD, they can be challenging to work with in our software pipelines because of the
fact that PDF is not a machine-readable file format.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        People who publish documents on the web often use the Portable Document File (PDF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Variations
in the PDF format lead to diferences in how data can be extracted from files [
      </p>
      <sec id="sec-3-1">
        <title>2]. Diferences between versions of PDF and character sets used to encode text, among other issues present challenges for extracting data from these files [ 3].</title>
        <p>Joint Ontology Workshops (JOWO) - Episode XI: The Sicilian Summer under the Etna, co-located with the 15th International
∗Corresponding author.</p>
        <p>CEUR
Workshop</p>
        <p>ISSN1613-0073</p>
        <p>
          People are exploring the use of LLMs to automate tasks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Some have used LLMs for data extraction
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This technique works when the LLM can understand the format of the original file. We found that
GPT-4-Turbo could not extract tabular data from the our PDF file. Researchers have demonstrated that a
drawback of using LLMs is that they are known to provide plausible but incorrect responses, sometimes
termed “hallucination” [
          <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
          ]. Our concern about the risk of hallucination of data values through this
data extraction process motivated our decision to manually review each value the LLM reported.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2. WikiFCD</title>
      <p>
        We created WikiFCD to ofer web-based access to food composition tables from around the world[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
WikiFCD is a knowledge base of food items and food composition data free for anyone to reuse. In
order to make data from WikiFCD easier to reuse, we also map food items to FoodOn [
        <xref ref-type="bibr" rid="ref10">10, 11</xref>
        ]. We
maintain mappings to food items in Wikidata which allows us to integrate our food composition data
with the data in Wikidata as well as with datasets that also map their identifiers to Wikidata. This type
of integration across databases opens many pathways for investigating questions of how nutritional
intake interacts with topics related to human health [12].
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. WikiFCD Data Import</title>
      <p>When FCTs are published using the CSV format, which is machine-readable, we are able to provide them
to our software pipeline which is enabled by Wikidata Integrator 2. Many researchers and practitioners
use WikidataIntegrator to work with data pipelines for Wikidata and for other Wikibases [13].</p>
      <p>When FCTs are published using the PDF format, which is not machine-readable, we need to extract
the data and convert it to CSV. The PDF family of file formats has a long history and the developers
of this file format have improved it in diferent ways between each version of the format. Software
developers have created a wide variety of tools for working with diferent versions of PDF, with the
number of available tools increasing for more recent versions of the format.</p>
    </sec>
    <sec id="sec-6">
      <title>4. LLM-Powered Data Extraction</title>
      <p>The PDF file for the 1957 Cameroon FCT was originally created on Saturday, February 10, 2001.
According to the metadata for the file, the version of this PDF is 1.3. This PDF file contains a textual
introduction as well as several pages of food composition values for food items presented in a table.</p>
      <p>To extract food composition data from the scanned PDF, we initially tested several Python libraries
including pyPDF3, pdfplumber4, and pdf2image+OCR5. However, we found that these libraries were
unable to extract the data successfully due to the structure and quality of the source file. The PDF
contained scanned images with tabular data, which lacked embedded text layers and often had faint or
distorted lines, making traditional parsing unreliable for our specific PDF file.</p>
      <p>We subsequently decided to try GPT-4-Turbo with vision capabilities via the OpenAI API to extract
the tabular data. Because the food composition tables spanned only a few pages, we opted to manually
capture screenshots of relevant sections and save them in the Portable Networked Graphics (PNG)
format. This approach allowed for precise targeting of the visual content while minimizing noise from
surrounding text or headers. This also simplified the prompting process by allowing direct alignment
between the image content and the extraction instructions.</p>
      <p>We used a simple Python workflow in which the images were read as binary objects and encoded
in base64 before being passed to GPT-4-Turbo via the beta.chat.completions.parse method. We used
a structured prompt to instruct the LLM to extract the tabular content of the image. We also inculed</p>
      <sec id="sec-6-1">
        <title>2https://pypi.org/project/wikidataintegrator/ 3https://pypi.org/project/pypdf/ 4https://pypi.org/project/pdfplumber/ 5https://pypi.org/project/pdf2image/</title>
        <p>specific instructions for how to handle missing data, represented by the ’-’ symbol. We instructed
the model to parse the output into a Pydantic6 model (TableData) and convert it into JSON. We then
converted the structured JSON to a CSV file.</p>
        <p>In Figure 1, we see a section of the original PDF of the FCT, in which the food items labeled in French.
To improve usability of this data, we instructed the language model to translate the food item labels from
French to English so that we could provide English language labels for these food items in the WikiFCD
system. The LLM generally produced accurate translations; however, in one instance it returned a
diferent French label than what appeared in the original PDF file. As a result, we implemented manual
verification to ensure consistency and accuracy of the final output data.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Human Review of Data Quality</title>
      <p>To validate the method, we performed a manual review of the results generated by the LLM and
compared the values with the numbers in the original food composition table. In total we found eighty
ifve food items in the Cameroon FCT. For each food item there are eleven nutrients described. Of the
nine hundred thirty five food composition values, the LLM provided thirteen incorrect values. The LLM
also provided two inaccurate English labels.</p>
      <p>We found that this LLM-powered approach generally worked well, except when the values in the
PDF were illegible due to the low quality of the scan that cut of some numerals. We observed that for
food items where the LLM only reported one or two incorrect component values, the incorrect values
seemed to be duplicated from a nearby value. For example, for the food items labeled ‘Courge’ seen in
2, the value the LLM returned for ‘Formic Insoluble matter’ was incorrect. The LLM returned a value of
‘0.6’, which may have been duplicated from the ‘Ash’ value of ‘0.6’.</p>
      <p>In addition to numerals, the LLM accurately reported the dash character ’-’ used in the original file to
indicate ’no value’ as seen in 1. However, there were few instances where the ’-’ was reported as ’0’
(see the food item labeled ’Canne à sucre’).</p>
      <p>When the person who created the digital scan of the print version of the FCT was holding open the
pages some of the values in the table were cut of and didn’t come out clearly. In Figure 3 the values for
’Avocat’ are cut of, but still legible. The values for ’Barbadine’ are not legible. Unsurprisingly, the LLM
returned multiple incorrect values for ’Barbadine’.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Discussion</title>
      <p>We tested this LLM-based data extraction approach to determine if this could be a viable strategy for
working with data in legacy versions of PDF. The LLM performed well translating the labels of the food
items and the column names into English. While the LLM reported accurate values for most nutrients
for most food items, but also reported multiple inaccurate values. The number of inaccurate values the
LLM reported indicates that this strategy would need to be followed by manual review of the data.</p>
      <p>Manual review of a small dataset like that of the Cameroon FCT is time-intensive and requires
precision. Comparing the time it would take a human to manually generate a spreadsheet of values
from consulting the PDF file to the time it takes to manually review the data would be an interesting
experiment. It is possible that it would be faster to manually create a CSV. We are aware of concerns
regarding the energy consumption associated with LLM integration into software systems [ 14]. When
we consider the resources required to generate a set of values that require manual verification it becomes
more dificult to justify this approach.</p>
      <p>We found the data in rows one through forty two to be accurate. The accuracy of the LLM declined
after row forty three. In future work we would like to test additional strategies to improve the LLM’s
performance. One strategy we could employ would be to create one PNG image for each page within
the PDF to test if providing smaller sections of the dataset to the LLM would improve performance. We
could also test if structuring the prompt to work row-by-row could improve performance.</p>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusion</title>
      <p>Identifying tools that can extract data from legacy versions of PDF is challenging. Due to the fact that
some of the food composition tables we would like to import into WikiFCD are in legacy versions of
PDF, we have an interest in automating data extraction from these files. We tested a workflow in which
we converted a PDF file to a PNG and asked GPT-4-Turbo to extract the food composition data from
the image. We manually reviewed each value reported by GPT-4-Turbo for accuracy by comparison
with the values in the PDF file. While we did find that the LLM reported some inaccurate values, we
believe this approach is still worth consideration. With the rapid rate of improvements in performance
of LLMs it is possible that their capacity for this type of data extraction could improve in the future.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>We thank the Joint Food Ontology Working Group for productive discussions about FoodOn and data
related to food. We thank the Wikidata community for continuing to improve the Wikidata knowledge
base.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>Or (by using the activity taxonomy in ceur-ws.org/genai-tax.html):
We used OpenAI GPT-4-Turbo to extract text from a PDF file as the basis for the process that we then
manually reviewed. We also asked OpenAI GPT-4-Turbo to translate the labels for the food items from
French into English. After using these tools, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Allegrezza</surname>
          </string-name>
          ,
          <article-title>Addressing The Problem of File Formats Obsolescence: Italian Guidelines on File Format Conversion for the Long-Term Preservation of Electronic Records</article-title>
          , in: iPRES 2024 Papers - International
          <source>Conference on Digital Preservation</source>
          ,
          <year>2024</year>
          . Https://ipres2024.pubpub.org/pub/oswmkgvc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Corrêa</surname>
          </string-name>
          , P.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Zander</surname>
          </string-name>
          ,
          <article-title>Unleashing tabular content to open data: A survey on pdf table extraction methods and tools</article-title>
          ,
          <source>in: Proceedings of the 18th Annual International Conference on Digital Government Research</source>
          , dg.o '
          <volume>17</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          . URL: https://doi.org/10.1145/3085228.3085278. doi:
          <volume>10</volume>
          .1145/3085228.3085278.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ockerbloom</surname>
          </string-name>
          ,
          <article-title>Archiving and preserving pdf files</article-title>
          ,
          <source>RLG DigiNews 5</source>
          (
          <year>2001</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          , Taskbench:
          <article-title>Benchmarking large language models for task automation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>37</volume>
          (
          <year>2024</year>
          )
          <fpage>4540</fpage>
          -
          <lpage>4574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Konet</surname>
          </string-name>
          , I. Thomas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gartlehner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kahwati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hilscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kugley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crotty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Viswanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chew</surname>
          </string-name>
          ,
          <article-title>Performance of two large language models for data extraction in evidence synthesis</article-title>
          ,
          <source>Research synthesis methods 15</source>
          (
          <year>2024</year>
          )
          <fpage>818</fpage>
          -
          <lpage>824</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          , I. Kulikov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Neural text generation with unlikelihood training</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . URL: https://doi.org/10.1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy</article-title>
          , challenges, and open questions,
          <year>2023</year>
          . arXiv:
          <volume>2311</volume>
          .
          <fpage>05232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Thornton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seals-Nutt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matsuzaki</surname>
          </string-name>
          ,
          <article-title>Introducing wikifcd: Many food composition tables in a single knowledge base</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>2969</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>D. M. Dooley</surname>
            ,
            <given-names>E. J.</given-names>
          </string-name>
          <string-name>
            <surname>Grifiths</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          <string-name>
            <surname>Gosal</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          <string-name>
            <surname>Buttigieg</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hoehndorf</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>L. M.</given-names>
          </string-name>
          <string-name>
            <surname>Schriml</surname>
            ,
            <given-names>F. S.</given-names>
          </string-name>
          <string-name>
            <surname>Brinkman</surname>
            ,
            <given-names>W. W.</given-names>
          </string-name>
          <string-name>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <article-title>Foodon: a harmonized food ontology to increase global food traceability, quality control and data integration, npj Science of Food 2 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>