<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Annotation of TSOTSATable Dataset⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Azanzi Jiomekong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uriel Melie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hippolyte Tapamo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaoussou Camara</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Yaounde I</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EIR-IMTICE, University Alioune Diop de Bambey</institution>
          ,
          <country country="SN">Sénégal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Food Composition Table (FCT) or Food Composition Databases (FCD) are composed of tables that describe foods and its composition. During SemTab 2022, we proposed a Food Composition Table dataset that we called TSOTSATable dataset. In this paper, we present how the annotation of this dataset is being done using Wikidata, FoodOn and Open Research Knowledge Graph (ORKG). The extracted tables are annotated using Wikidata and FoodOn. The scientific papers from which knowledge are extracted are annotated using Open Research Knowledge Graph. The annotation consists of matching the cells of the food tables to the Wikidata Knowledge Graph and FoodOn ontology, the matching of elements of the table describing the scientific papers to ORKG resources, the detection of the type of elements of each columns (CTA) and the matching of the relations between columns to ORKG properties. During the annotation, we found that many tables were not relevant to the food domain. Thus, we added a new annotation task that is Irrelevant Table Detection (ITD). This consists for a domain dataset, to determine the tables that are not relevant to this domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Food Science and Nutrition</kwd>
        <kwd>Food information engineering</kwd>
        <kwd>Food Composition Tables</kwd>
        <kwd>Semantic Table Annotation</kwd>
        <kwd>TSOTSATable dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>GitHub1 and Google Collaboratory2. a video showing how we automatically extract tables from
PDFs is also available3.</p>
      <p>A study of the TSOTSATable dataset showed that the data extracted from scientific papers
were more complete than the ones extracted from Zenodo. On the other hand, Zenodo contains
a lot of tables not relevant to the domain of food science and nutrition. Thus, we decided to
start the annotation of tables extracted from scientific papers.</p>
      <p>
        In this paper, we describe how this dataset is being curated using Wikidata, FoodOn [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
Open Research Knowledge Graph (ORKG) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Raw data is available on GitHub4 and an excerpt
of this dataset, containing data extracted from papers of the "Journal of Food Composition
Tables" are already annotated5 and published on Zenodo repository [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The source code of the
TSOTSATable system used for the annotation is published under MIT license. We provided
two source code: one for the annotation of the tables using Wikidata and FoodOn6 and one
for the annotation of scientific papers 7. The dataset is published under Creative Commons
Attribution-ShareAlike 4.0 International License8.
      </p>
      <p>In the rest of this paper, we present the annotation in Section 2, the overview of the dataset
in Section 3 and the conclusion in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TSOTSATable annotation</title>
      <p>The curation started by the annotation of an excerpt of the TSOTSATable dataset, containing
data extracted from papers of the "Journal of Food Composition and Analysis". To this end, two
tools were designed: one for the automatic annotation using Wikidata and FoodOn and one
for automatic annotation using Open Research Knowledge Graph. In the following sections, a
short overview of the ontology and Knowledge Graphs used for the annotation is presented
(see Section 2.1). Thereafter, the diferent annotation tasks are presented in Section 2.2 and
ifnally, the annotation process is presented 2.3.</p>
      <sec id="sec-2-1">
        <title>2.1. Ontology and Knowledge Graphs Annotations</title>
        <p>To annotate the TSOTSATable dataset, we choose diferent vocabularies corresponding to the
diferent domains covered by the dataset. The dataset is composed of two types of tables:
the tables describing foods and their composition (food science and nutrition domain) and a
table containing the list of scientific papers (digital library domain) from which the tables are
extracted.</p>
        <p>The first step of the annotation consists of the selection of relevant KGs or ontologies to use.
Given that this is a food domain, we compared several excerpts of the dataset with several food
1https://github.com/Neuralearn/pdf-to-excel
2https://colab.research.google.com/drive/1gOPBCVO9VtKcoIewXyr_6nNoxo1Bkqbz
3www.youtube.com/watch?v=HZh31OGiQRQ
4https://github.com/jiofidelus/tsotsa/tree/main/TSOTSATable_dataset/rawData
5https://github.com/jiofidelus/tsotsa/tree/main/TSOTSATable_dataset/annotatedData
6https://github.com/jiofidelus/tsotsa/tree/main/TSOTSATable
7https://github.com/jiofidelus/SemTabTable-Papers/tree/main/sourceCode
8https://creativecommons.org/licenses/by-sa/4.0/
ontologies hosted on Bioportal. We used the ontology recommender of Bioportal to search for
the most appropriate ontology in the food domain that can be used to annotate the dataset. Fig.
1 presents an example of the use of ontology recommender to search for the most appropriate
ontology.</p>
        <p>
          From the ontology recommender, we found that FoodOn is the most appropriate ontology to
annotate the dataset. FoodOn9 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is an OBO Foundry ontology used to describe domestical
animal food, animal and plant food sources, food categories and products, etc. The FoodOn
ontology can be explored using several ontology lookup services. In our case, we used the
Ontology Lookup Service10 (OLS). OLS is a repository of several biomedical ontologies. We
used the OLS API to search for relevant annotations and annotate the dataset. To this end, we
ifrst search for the list of all CEA given a cell. Thereafter, we identify for each CEA their CTA.
Finally, we identify the entity to which the majority of the cells are linked to and vote this as
the CTA. The entities linked to the entity voted as the CTA and corresponding to the entities
9http://foodon.org
10https://www.ebi.ac.uk/ols/docs/api
found during the lookup are designated as the CEA of the diferent cells of the table. To improve
the results obtained after the automatic annotation, a PhD in Food Science and Nutrition is
currently checking the annotated dataset.
        </p>
        <p>On the other hand, we manually searched for a set of terms in the Wikidata KG using its
search engine and we found that Wikidata contains a lot of relevant annotations. Wikidata11 is
amongst the most popular KGs in the world. It is involved in the SemTab challenge since the
challenge was launched in 2019. Once we found that Wikidata contain relevant annotations,
we built an automatic tool for the annotation of the dataset using the Wikidata MediaWiki
API12. The same disambiguation process used during the annotation of the dataset by FoodOn
ontology was used to select amongst the entities the ones that may match to the elements of
the table.</p>
        <p>Concerning the annotation of scientific papers from which tables are extracted, we rely
on ORKG because we have a great experience on the use of this KG for annotating scientific
papers. Open Research Knowledge Graph13 (ORKG) is a scholarly KG used to acquire, publish
and process structured scholarly knowledge published in the scholarly literature. It is built
according to the principle of Open Science, Open Data, and Open Source. We used the automatic
annotation feature of ORKG to annotate all the scientific papers from which the tables were
extracted.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Annotations tasks</title>
        <p>During the annotation process, we found that many tables were not relevant to the domain
of nutrition. On the other hand, ORKG is based on an ontology. This ontology describes a
research paper as composed of paper metadata and its semantic description. The semantic
description consists of (1) assigning ORKG classes to the diferent key-insights extracted, (2)
defining several properties for comparing research contributions, (3) and comparison tables of
research contributions dealing with the same research problem. From this ontology, instances
are instantiated during the paper annotation. Based on this, and the annotations tasks generally
proposed by SemTab challenge, we defined the following annotations tasks:
• Column Entity Annotation (CEA): This is to match each cell of the tables to the
ontology/KG entity. The entities in the tables extracted were matched to Wikidata and
FoodOn. Concerning scientific papers, we used ORKG resources, which can be a class, an
instance, or a property.
• Column Type Annotation (CTA): this consists of the assignment of classes from the
ontology and KGs to columns of the tables.
• Column Property Annotation (CPA): This is the assignment of a property to the
relationship between two columns in tables. We found it dificult to identify properties
amongst columns of the tables. In efect, the majority of these tables contain numbers in
the cells and sometimes in the headers, abbreviations of food components (for instance,
k=potassium, Fe=Fer, Mn=Manganese, etc.) The fact that the columns are filled with only
11https://www.wikidata.org/
12https://www.wikidata.org/w/api.php
13https://orkg.org/
numbers make it dificult to build an automatic tool for determining the relations between
two columns. Thus, in the current version of the dataset, this annotation task concerns
only the scientific papers.
• Irrelevant Table Detection (ITD): this task consists of the detection of tables that are
not relevant to the domain of Food and Nutrition. It should be noted that this task is
currently manual.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Annotations process</title>
        <p>The raw data contained the following types of files:
• TSOTSATable source: this is a CSV file containing information on scientific papers from
which the tables were extracted. It is named 0 −  . in the dataset.
• TSOTSATable files: these are the CSVs files containing the tables extracted from the
scientific papers. Each file is named using a unique identifier. The latter allows linking
the file to the corresponding source file in the knowledge source file. The file name of
each table is obtained using his ID in the data source plus a number denoting the order of
its apparition in the data source. For instance, the 3 table in a scientific paper that has
the  =  12 is named  _3.</p>
        <p>Concerning the annotation, we created three folders corresponding to the three vocabularies
used to annotate the TSOTSATable dataset. Each folder contains diferent target annotations:
• TSOTSATable_CEA: this is the file containing the CEA of the tables.
• TSOTSATable_CTA: this is the file containing the CTA of the tables.</p>
        <p>• TSOTSATable_CPA: this is the file containing the CPA of the tables.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Annotated Dataset overview</title>
      <p>
        A subset containing 251 tables were annotated and published on Zenodo repository [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This
subset contains:
• 38 irrelevant tables,
• 212 relevant tables,
• One table corresponding to the scientific reference from which the tables have been
extracted.
      </p>
      <p>Food Composition tables were annotated using Wikidata and FoodOn and the scientific papers
from which data is extracted was annotated using Open Research Knowledge Graph. Table 1
presents the number of entities and types annotated using Wikidata and FoodOn. An expert in
Food Science and Nutrition was invited to select these annotations randomly and verify their
relevance. Concerning scientific papers, around 500 terms were annotated using ORKG.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        In a recent work, we extracted Food Composition data from scientific papers and we built
a tabular dataset with it [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paper presents how this dataset is being annotated using
Wikidata, FoodOn and Open Research Knowledge Graph. To this end, Cell Entity Annotation
(CEA), Column Type Annotation (CTA), Column Property Annotation (CPA) and Relevant
Table Detection (RTD) tasks are considered. The first three tasks are well known Semantic
Table Annotation tasks. However, the last one were found during the annotation process. In
fact, the table extraction tool extracts all the tables that the scientific paper contains. However,
some tables are not relevant to the Food Science and nutrition domain. Thus, we introduce
this new task. We found many NULL annotation, due to the fact that many entities does not
have reference to Wikidata and FoodOn. It should be noted that the detection of irrelevant
tables is still done manually. We are planning to develop an additional module which allow to
automatically detect the tables that are relevant to the Food and nutrition domain before their
annotation.
      </p>
      <p>Future work consists of finalizing the annotation and using this dataset to build a
TSOTSAGraph, a Food Composition Knowledge Graph.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiomekong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Foko</surname>
          </string-name>
          ,
          <article-title>Towards an approach based on knowledge graph refinement for tabular data to knowledge graph matching</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Greenfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Southgate</surname>
          </string-name>
          ,
          <article-title>Food composition data: production, management, and use</article-title>
          , Food &amp; Agriculture Org.,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalis</surname>
          </string-name>
          , et al.,
          <article-title>Update of the moroccan food composition tables: Towards a more reliable tool for nutrition research</article-title>
          ,
          <source>Journal of Food Composition and Analysis</source>
          <volume>87</volume>
          (
          <year>2020</year>
          )
          <fpage>103397</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Azanzi</surname>
          </string-name>
          , et al.,
          <article-title>A large scale corpus of food composition tables, Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), CEUR-WS. org (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dooley</surname>
          </string-name>
          , et al.,
          <article-title>Foodon: a harmonized food ontology to increase global food traceability, quality control and data integration, npj Science of Food 2 (</article-title>
          <year>2018</year>
          )
          <fpage>23</fpage>
          -.
          <source>doi:10.1038/ s41538-018-0032-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , et al.,
          <article-title>Improving access to scientific literature with knowledge graphs, BIBLIOTHEK - Forschung und Praxis (</article-title>
          <year>2020</year>
          ). doi:http://dx.doi.org/10.18452/22049.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiomekong</surname>
          </string-name>
          , U. Melie,
          <article-title>TSOTSATable dataset: a dataset of food and its composition</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.8169063.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>