<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Large Scale Corpus of Food Composition Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Azanzi Jiomekong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cosmas Etoga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brice Foko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vadel Tsague</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martins Folefac</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sorel Kana</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mouhamadou Mansour Sow</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaoussou Camara</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Yaounde I</institution>
          ,
          <addr-line>Yaounde</addr-line>
          ,
          <country country="CM">Cameroon</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pôle Science et Technologie du Numérique, Université Virtuelle du Sénégal</institution>
          ,
          <addr-line>Dakar</addr-line>
          ,
          <country country="SN">Sénégal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey</institution>
          ,
          <addr-line>Bambey</addr-line>
          ,
          <country country="SN">Sénégal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>neuralearn.</institution>
          <addr-line>ai</addr-line>
          ,
          <country country="CM">Cameroon</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we introduce TSOTSACorpus, a large scale corpus of Food Composition Tables composed of more than 16,000 tables collected from scientific and Zenodo repositories. Our continuing maintenance and curation aims at growing this corpus in order to furnish good quality, up-to-date and cultural heritage of all foods information in the world. Compared to related datasets (INFOODS, LanguaL), we found that this corpus contains more information. In addition, it can be processed by humans and machines.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Food Information Engineering</kwd>
        <kwd>Food Composition Database</kwd>
        <kwd>Food Composition Table</kwd>
        <kwd>Tabular data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and annotated using biomedical ontologies. The work we present in this paper is an ongoing
work and the next Section will present the current version of TSOTSACorpus.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TSOTSACorpus: a large scale corpus of FCT</title>
      <p>Globally, TSOTSACorpus is licensed under a Creative Commons Attribution-ShareAlike 4.0
International License. The development version is available for download on Google Drive1
and will be published on Zenodo as soon as the curation and annotation process is finished.
The source code we are using for the extraction of tables from PDFs documents is available
on GitHub2 and Google Collaboratory3. A video showing how we automatically extract tables
from PDFs is also available4. Once the tables are extracted from scientific papers, we have also
considered the extraction of datasets from zenodo.org - the source is available on GitHub5.</p>
      <p>TSOTSACorpus construction is an extensive work of semi-automatic collection, extraction,
curation and annotation of food data. Currently, more than 5,000 PDF documents acquired
from scientific repositories are processed and more than 11,000 tables extracted from them.
To this end, we used Neural Networks (NN) algorithms and we followed the Table detection,
Text detection, Text recognition steps. Concerning the implementation, we rely on PaddleOCR
which were trained with the Paddle framework in the Python programming language. On the
other hand, Zenodo API6 were used to automatically extract FCT datasets - more than 5,000
tables are currently extracted.</p>
      <p>The current version of the corpus is composed of more than 16,000 tables of food, describing
more than 60,000 foods, 200 food groups, and 800 food components. It covers the food consumed
in more than 123 countries from 1987 to 2022. At this stage of this work, the extraction of
additional tables, the curation and annotation process are in progress. The curation consists of
linking each tabular data to the knowledge source from which it was built, identify and delete
duplicate knowledge sources, arrange data in the CSV files so as to be exactly like the ones in
PDF. The annotation process is being done by using biomedical ontologies (identified using
ontobee.org - FoodOn, SNOMED CT and NCIT are currently used). We are also planning to
consider the annotation with Wikidata and DBpedia knowledge Graphs. We expect to produce
the first version, curated and annotated, composed of more than 20,000 tables during the first
quarter of 2023 so that it can be used during the future editions of the SemTab challenge7.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgment</title>
      <p>We are grateful to SemTab organizers for having given us the opportunity to share this work
with the community. We are also grateful to Vinsight and neuralearn.ai for the training support.
1https://drive.google.com/drive/u/1/folders/1U2dEye_f02MhHOkmowuh2UyAKX60Ix39
2https://github.com/Neuralearn/pdf-to-excel
3https://colab.research.google.com/drive/1gOPBCVO9VtKcoIewXyr_6nNoxo1Bkqbz
4www.youtube.com/watch?v=HZh31OGiQRQ
5https://github.com/iconoyuri/zenodo-file-downloader
6https://zenodo.org/api/records/
7https://www.cs.ox.ac.uk/isg/challenges/sem-tab/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Garcia-Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Charaka</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. S. Deoula</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>El Kinany</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Benslimane</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Charbotel</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Soliman</surname>
            , I. Huybrechts,
            <given-names>G. A.</given-names>
          </string-name>
          <string-name>
            <surname>Soliman</surname>
          </string-name>
          , et al.,
          <article-title>Update of the moroccan food composition tables: Towards a more reliable tool for nutrition research</article-title>
          ,
          <source>Journal of Food Composition and Analysis</source>
          <volume>87</volume>
          (
          <year>2020</year>
          )
          <fpage>103397</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiomekong</surname>
          </string-name>
          , Comparison of food composition tables/databases,
          <year>2022</year>
          . URL: https://orkg. org/comparison/R206121/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>