<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Large Scale Corpus of Food Composition Tables</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Azanzi</forename><surname>Jiomekong</surname></persName>
							<email>fidel.jiomekong@facsciences-uy1.cm</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Yaounde I</orgName>
								<address>
									<settlement>Yaounde</settlement>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cosmas</forename><surname>Etoga</surname></persName>
							<email>etogacosmas@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Yaounde I</orgName>
								<address>
									<settlement>Yaounde</settlement>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Brice</forename><surname>Foko</surname></persName>
							<email>fokobrice3@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Yaounde I</orgName>
								<address>
									<settlement>Yaounde</settlement>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Vadel</forename><surname>Tsague</surname></persName>
							<email>vadel.tsague@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Yaounde I</orgName>
								<address>
									<settlement>Yaounde</settlement>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martins</forename><surname>Folefac</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">neuralearn.ai</orgName>
								<address>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sorel</forename><surname>Kana</surname></persName>
							<email>jsorelkana@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="department">neuralearn.ai</orgName>
								<address>
									<country key="CM">Cameroon</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mouhamadou</forename><forename type="middle">Mansour</forename><surname>Sow</surname></persName>
							<email>mouhamadoum.sow@uvs.edu.sn</email>
							<affiliation key="aff2">
								<orgName type="department">Pôle Science et Technologie du Numérique</orgName>
								<orgName type="institution">Université Virtuelle du Sénégal</orgName>
								<address>
									<settlement>Dakar</settlement>
									<country key="SN">Sénégal</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gaoussou</forename><surname>Camara</surname></persName>
							<email>gaoussou.camara@uadb.edu.sn</email>
							<affiliation key="aff3">
								<orgName type="laboratory">Unité de Formation et de Recherche en Sciences Appliquées et des TIC</orgName>
								<orgName type="institution">Université Alioune Diop de Bambey</orgName>
								<address>
									<settlement>Bambey</settlement>
									<country key="SN">Sénégal</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Large Scale Corpus of Food Composition Tables</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AE732DD272000FDB1E3EFE926E81A5B1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Food Information Engineering</term>
					<term>Food Composition Database</term>
					<term>Food Composition Table</term>
					<term>Tabular data</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we introduce TSOTSACorpus, a large scale corpus of Food Composition Tables composed of more than 16,000 tables collected from scientific and Zenodo repositories. Our continuing maintenance and curation aims at growing this corpus in order to furnish good quality, up-to-date and cultural heritage of all foods information in the world. Compared to related datasets (INFOODS, LanguaL), we found that this corpus contains more information. In addition, it can be processed by humans and machines.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, many Food Composition Tables (FCT) <ref type="bibr" target="#b0">[1]</ref> have been published in several formats (PDF, CSV, XSLX). However, these data are scattered on the Internet, making their exploitation difficult because one has to search, get data and extract information from them. On the other hand, many FCT whether it be at the country, regional or world wide level suffers from many problems: (1) Static databases sometimes in PDF or in XLSX, CSV, ODT formats; (2) Outdated data -the comparison of several FCT <ref type="bibr" target="#b1">[2]</ref> showed that FCT should be always update because eating habit change over time; (3) Not harmonized data.</p><p>In this paper, we propose to extract, unify and link all Food Composition Tables published worldwide and accessible either in the form of scientific publication or in a free and/or open source license in a strong centralized corpus of FCT. One way to achieve this is by making each dataset accessible in a machine-readable format, which can be realized by putting these tables in CSV format and enriching them with metadata and data on their provenance. To this end, knowledge is automatically extracted from scientific literature and Zenodo repositories, curated and annotated using biomedical ontologies. The work we present in this paper is an ongoing work and the next Section will present the current version of TSOTSACorpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">TSOTSACorpus: a large scale corpus of FCT</head><p>Globally, TSOTSACorpus is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The development version is available for download on Google Drive 1 and will be published on Zenodo as soon as the curation and annotation process is finished. The source code we are using for the extraction of tables from PDFs documents is available on GitHub 2 and Google Collaboratory 3 . A video showing how we automatically extract tables from PDFs is also available 4 . Once the tables are extracted from scientific papers, we have also considered the extraction of datasets from zenodo.org -the source is available on GitHub 5 .</p><p>TSOTSACorpus construction is an extensive work of semi-automatic collection, extraction, curation and annotation of food data. Currently, more than 5,000 PDF documents acquired from scientific repositories are processed and more than 11,000 tables extracted from them. To this end, we used Neural Networks (NN) algorithms and we followed the Table detection, Text detection, Text recognition steps. Concerning the implementation, we rely on PaddleOCR which were trained with the Paddle framework in the Python programming language. On the other hand, Zenodo API 6 were used to automatically extract FCT datasets -more than 5,000 tables are currently extracted.</p><p>The current version of the corpus is composed of more than 16,000 tables of food, describing more than 60,000 foods, 200 food groups, and 800 food components. It covers the food consumed in more than 123 countries from 1987 to 2022. At this stage of this work, the extraction of additional tables, the curation and annotation process are in progress. The curation consists of linking each tabular data to the knowledge source from which it was built, identify and delete duplicate knowledge sources, arrange data in the CSV files so as to be exactly like the ones in PDF. The annotation process is being done by using biomedical ontologies (identified using ontobee.org -FoodOn, SNOMED CT and NCIT are currently used). We are also planning to consider the annotation with Wikidata and DBpedia knowledge Graphs. We expect to produce the first version, curated and annotated, composed of more than 20,000 tables during the first quarter of 2023 so that it can be used during the future editions of the SemTab challenge 7 .</p></div>		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgment</head><p>We are grateful to SemTab organizers for having given us the opportunity to share this work with the community. We are also grateful to Vinsight and neuralearn.ai for the training support.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Update of the moroccan food composition tables: Towards a more reliable tool for nutrition research</title>
		<author>
			<persName><forename type="first">M</forename><surname>Khalis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Garcia-Larsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Charaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M S</forename><surname>Deoula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">El</forename><surname>Kinany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Benslimane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Charbotel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Soliman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Huybrechts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Soliman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Food Composition and Analysis</title>
		<imprint>
			<biblScope unit="volume">87</biblScope>
			<biblScope unit="page">103397</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Comparison of food composition tables/databases</title>
		<author>
			<persName><forename type="first">A</forename><surname>Jiomekong</surname></persName>
		</author>
		<ptr target="https://orkg.org/comparison/R206121/" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
