A Large Scale Corpus of Food Composition Tables
Azanzi Jiomekong1,* , Cosmas Etoga1 , Brice Foko1 , Vadel Tsague1 , Martins Folefac2 ,
Sorel Kana2 , Mouhamadou Mansour Sow3 and Gaoussou Camara4
1
  Department of Computer Science, University of Yaounde I, Yaounde, Cameroon
2
  neuralearn.ai, Cameroon
3
  Pôle Science et Technologie du Numérique, Université Virtuelle du Sénégal, Dakar, Sénégal
4
  Unité de Formation et de Recherche en Sciences Appliquées et des TIC, Université Alioune Diop de Bambey, Bambey,
Sénégal


                                         Abstract
                                         In this paper, we introduce TSOTSACorpus, a large scale corpus of Food Composition Tables composed of
                                         more than 16,000 tables collected from scientific and Zenodo repositories. Our continuing maintenance
                                         and curation aims at growing this corpus in order to furnish good quality, up-to-date and cultural heritage
                                         of all foods information in the world. Compared to related datasets (INFOODS, LanguaL), we found that
                                         this corpus contains more information. In addition, it can be processed by humans and machines.

                                         Keywords
                                         Food Information Engineering, Food Composition Database, Food Composition Table, Tabular data,


1. Introduction
In recent years, many Food Composition Tables (FCT) [1] have been published in several formats
(PDF, CSV, XSLX). However, these data are scattered on the Internet, making their exploitation
difficult because one has to search, get data and extract information from them. On the other
hand, many FCT whether it be at the country, regional or world wide level suffers from many
problems: (1) Static databases sometimes in PDF or in XLSX, CSV, ODT formats; (2) Outdated
data - the comparison of several FCT [2] showed that FCT should be always update because
eating habit change over time; (3) Not harmonized data.
   In this paper, we propose to extract, unify and link all Food Composition Tables published
worldwide and accessible either in the form of scientific publication or in a free and/or open
source license in a strong centralized corpus of FCT. One way to achieve this is by making each
dataset accessible in a machine-readable format, which can be realized by putting these tables
in CSV format and enriching them with metadata and data on their provenance. To this end,
knowledge is automatically extracted from scientific literature and Zenodo repositories, curated

SemTab@ISWC 2022, October 23–27, 2022, Hangzhou, China (Virtual)
$ fidel.jiomekong@facsciences-uy1.cm (A. Jiomekong); etogacosmas@gmail.com (C. Etoga);
fokobrice3@gmail.com (B. Foko); vadel.tsague@gmail.com (V. Tsague); martinsderick99@gmail.com (M. Folefac);
jsorelkana@gmail.com (S. Kana); mouhamadoum.sow@uvs.edu.sn (M. M. Sow); gaoussou.camara@uadb.edu.sn
(G. Camara)
 https://sites.google.com/facsciences-uy1.cm/azanzijiomekong (A. Jiomekong)
 0000-0002-0877-7063 (A. Jiomekong)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and annotated using biomedical ontologies. The work we present in this paper is an ongoing
work and the next Section will present the current version of TSOTSACorpus.


2. TSOTSACorpus: a large scale corpus of FCT
Globally, TSOTSACorpus is licensed under a Creative Commons Attribution-ShareAlike 4.0
International License. The development version is available for download on Google Drive1
and will be published on Zenodo as soon as the curation and annotation process is finished.
The source code we are using for the extraction of tables from PDFs documents is available
on GitHub2 and Google Collaboratory3 . A video showing how we automatically extract tables
from PDFs is also available4 . Once the tables are extracted from scientific papers, we have also
considered the extraction of datasets from zenodo.org - the source is available on GitHub5 .
   TSOTSACorpus construction is an extensive work of semi-automatic collection, extraction,
curation and annotation of food data. Currently, more than 5,000 PDF documents acquired
from scientific repositories are processed and more than 11,000 tables extracted from them.
To this end, we used Neural Networks (NN) algorithms and we followed the Table detection,
Text detection, Text recognition steps. Concerning the implementation, we rely on PaddleOCR
which were trained with the Paddle framework in the Python programming language. On the
other hand, Zenodo API6 were used to automatically extract FCT datasets - more than 5,000
tables are currently extracted.
   The current version of the corpus is composed of more than 16,000 tables of food, describing
more than 60,000 foods, 200 food groups, and 800 food components. It covers the food consumed
in more than 123 countries from 1987 to 2022. At this stage of this work, the extraction of
additional tables, the curation and annotation process are in progress. The curation consists of
linking each tabular data to the knowledge source from which it was built, identify and delete
duplicate knowledge sources, arrange data in the CSV files so as to be exactly like the ones in
PDF. The annotation process is being done by using biomedical ontologies (identified using
ontobee.org - FoodOn, SNOMED CT and NCIT are currently used). We are also planning to
consider the annotation with Wikidata and DBpedia knowledge Graphs. We expect to produce
the first version, curated and annotated, composed of more than 20,000 tables during the first
quarter of 2023 so that it can be used during the future editions of the SemTab challenge7 .


Acknowledgment
We are grateful to SemTab organizers for having given us the opportunity to share this work
with the community. We are also grateful to Vinsight and neuralearn.ai for the training support.

1
  https://drive.google.com/drive/u/1/folders/1U2dEye_f02MhHOkmowuh2UyAKX60Ix39
2
  https://github.com/Neuralearn/pdf-to-excel
3
  https://colab.research.google.com/drive/1gOPBCVO9VtKcoIewXyr_6nNoxo1Bkqbz
4
  www.youtube.com/watch?v=HZh31OGiQRQ
5
  https://github.com/iconoyuri/zenodo-file-downloader
6
  https://zenodo.org/api/records/
7
  https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
References
[1] M. Khalis, V. Garcia-Larsen, H. Charaka, M. M. S. Deoula, K. El Kinany, A. Benslimane,
    B. Charbotel, A. S. Soliman, I. Huybrechts, G. A. Soliman, et al., Update of the moroccan
    food composition tables: Towards a more reliable tool for nutrition research, Journal of
    Food Composition and Analysis 87 (2020) 103397.
[2] A. Jiomekong, Comparison of food composition tables/databases, 2022. URL: https://orkg.
    org/comparison/R206121/.