=Paper=
{{Paper
|id=Vol-3324/om2022_LTpaper4
|storemode=property
|title=BiodivTab: semantic table annotation benchmark construction, analysis, and new additions
|pdfUrl=https://ceur-ws.org/Vol-3324/om2022_LTpaper4.pdf
|volume=Vol-3324
|authors=Nora Abdelmageed,Sirko Schindler,Birgitta König-Ries
|dblpUrl=https://dblp.org/rec/conf/semweb/AbdelmageedSK22
}}
==BiodivTab: semantic table annotation benchmark construction, analysis, and new additions==
BiodivTab: Semantic Table Annotation Benchmark Construction, Analysis, and New Additions Nora Abdelmageed1,2,3 , Sirko Schindler1,3 and Birgitta König-Ries1,2,3 1 Heinz Nixdorf Chair for Distributed Information Systems 2 Michael Stifel Center Jena 3 Friedrich Schiller University Jena, Jena, Germany Abstract Systems that annotate tabular data semantically have witnessed increasing attention from the community in recent years; this process is commonly known as Semantic Table Annotation (STA). Its objective is to map individual table elements to their counterparts from a Knowledge Graph (KG). Individual cells and columns are assigned to KG entities and classes to disambiguate their meaning. STA-systems achieve high scores on the existing, synthetic benchmarks but often struggle on real-world datasets. Thus, realistic evaluation benchmarks are needed to enable the advancement of the field. In this paper, we detail the construction pipeline of BiodivTab, the first benchmark based on real-world data from the biodiversity domain. In addition, we compare it with the existing benchmarks. Moreover, we highlight common data characteristics and challenges in the field. BiodivTab is publicly available1 and has 50 tables as a mixture of real and augmented samples from biodiversity datasets. It has been applied during the SemTab 2021 challenge, and participants achieved F1-scores of at most ∼ 60% across individual annotation tasks. Such results show that domain-specific benchmarks are more challenging for state-of-the-art systems than synthetic datasets. Keywords Benchmark, Tabular Data, Cell Entity Annotation, Column Type Annotation, Knowledge Graph Matching 1. Introduction Systems that tackle annotating tabular data semantically have gained increasing attention from the community in recent years. Semantic Table Annotation (STA) tasks map individual table elements to their counterparts from a Knowledge Graph (KG) such as Wikidata [1], and DBpedia [2]. Here, individual cells and columns are assigned to KG entities and classes to disambiguate their meaning. The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab)2 opened the call for semantic interpretation of tabular data inviting automated annotation systems. It established a common standard for evaluating those systems [3, 4, 5]. Most of its benchmarks are auto-generated with no particular domain focus [6, 7, 8, 9, 10, 11]. The ToughTables Dataset (2T) [12], introduced in the 2021 edition of the challenge, is the only exception involving manual curation but is still artificially derived from general domain data. Real-world and domain-specific datasets pose different challenges as witnessed, 1 https://github.com/fusion-jena/BiodivTab Ontology Matching @ISWC 2022 $ nora.abdelmageed@uni-jena.de (N. Abdelmageed); sirko.schindler@uni-jena.de (S. Schindler); birgitta.koenig-ries@uni-jena.de (B. König-Ries) 0000-0002-1405-6860 (N. Abdelmageed); 0000-0002-0964-4457 (S. Schindler); 0000-0002-2382-9722 (B. König-Ries) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 2 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ e.g., by evaluation campaigns in other domains like semantic web services evaluations [13]. Therefore, the development of STA systems has to be accompanied by suitable benchmarks to make them applicable in real-world scenarios. Such benchmark should reflect idiosyncrasies and challenges immanent in different domains. In this paper, we focus on one important domain: Biodiversity is the assortment of life on Earth covering evolutionary, ecological, biological, and social forms. It is imperative to monitor the current state of biodiversity and its change over time and understand the forces driving it to preserve life in all its varieties. The recent IPBES worldwide evaluation3 predicts a dramatic decrease in biodiversity, causing an obvious decay in vital ecological functions. An expanding volume of heterogeneous data, especially tables, is produced and publicly shared in the biodiversity domain. Tapping into this wealth of information requires two main steps: On the one hand, individual datasets have to be fit for (re)use – a requirement that resulted in the FAIR principles [14]. On the other hand, complex analyses often require data of different sources, e.g., to examine the various interdependencies among processes in an ecosystem. The involved datasets need to be integrated which requires a certain degree of harmonization and mappings between them [15]. The semantic annotation of the respective datasets can substantially support both goals. Our unique contributions in this paper over our previous work [16] are as follows: • Detailed explanation of the creation and data augmentation of BiodivTab. • An extensive discussion of idiosyncrasies and challenges in biodiversity datasets. • The creation of a new ground truth based on DBpedia. • A characterization of BiodivTab including concepts covered. • Evaluation of BiodivTab compared to other existing benchmarks. • Applications of BiodivTab. The remainder of this paper is organized as follows: Section 2 summarizes the required background. We detail the construction of BiodivTab in Section 3. Section 4 provides an evaluation of BiodivTab. Finally, we conclude in Section 5. 2. Background Semantic Table Annotation: The SemTab challenge has provided a community forum for STA tasks over the course of so far four editions: 2019-2021 [6, 7, 17], and 20224 . The challenge established common standards to evaluate different approaches in the field. It captures increasing attention from the community. The best-performing participants in 2021 are DAGOBAH [18], MTab [19], and JenTab [20]. The challenge formulated three tasks illustrated by Figure 1. Each task matches a table component to its counterpart within a target KG: • Cell Entity Annotation (CEA) matches individual cells to entities. • Column Type Annotation (CTA) assigns a semantic column type. • Column Property Annotation (CPA) links column pairs using a semantic property. Existing Benchmarks: The ultimate goal for STA-systems is to annotate real-world datasets. However, the datasets introduced in the first two years of the challenge are synthetic derived from different KGs [6, 7]. In 2020, the 2T dataset [12] is manually curated and focuses on the 3 https://ipbes.net/global-assessment 4 https://sem-tab-challenge.github.io/2022/ Orchis Orchidinae 2849312 Orchis Orchidinae 2849312 Orchis Orchidinae 2849312 Lilium Lilioideae 2752977 Lilium Lilioideae 2752977 Lilium Lilioideae 2752977 wd:Q161714 ("Orchis") wd:Q34740 ("genus") wdt:P846 ("GBIF taxon ID") wd:Q5194627 ("Lilium") text (a) CEA (b) CTA (c) CPA Figure 1: STA-tasks as defined by SemTab using a biodiversity example5 . 1 3 BiodivTab (B) Biodiversity Expert Data Collection Revision CTA_biodivtab_2021_WD_gt_ancestor bef 1 bix 2 2 bef 1 CEA 1 CTA 1 bix 9 Manual bix 2 CEA 2 CTA 2 6 Annotation bix 9 CEA 3 CTA 3 CTA Ancestors … Construction Biodiversity Data Sources d.w n … … … d.w n CEA n CTA n Selected Data Annotated Data benchmark CTA 1 bef 1 CEA 1 CTA 1 CTA 2 5 4 bix 2 CEA 2 CTA 2 Data CTA 3 Assembly bix 9 CEA 3 CTA 3 Augmentation … … … … CTA n tables gt targets d.w n CEA n CTA n All CTA Augmented Data BiodivTab (A) Figure 2: Steps of BiodivTab construction. disambiguation of possible annotation solutions. The datasets employed, so far, adhere to no particular domain but represented a sample from a wide range of general-purpose data. On the other hand, domain-specific datasets pose specific challenges as witnessed, e.g., by evaluation campaigns in other domains like semantic web services evaluations [13]. So, to ensure that those challenges are covered, there is a demand for domain-specific datasets based on real-world data. Such benchmarks have to comply with the standards already in use by the community to easily highlight current shortcomings and encourage further efforts on this topic. 3. BiodivTab Construction In this section, we explain the creation of BiodivTab, and the data sources used. Moreover, we describe the manual annotation phase involving biodiversity experts, the data augmentation step, and the final assembly and release of the benchmark. Figure 2 summarizes the construction of BiodivTab, we detail in the following. 3.1. Data Collection We decided on three data repositories that are well established for the ecological data: BExIS6 , BEFChina7 , and data.world8 . We queried these portals using 20 keywords, e.g., abandance, and species, from our previous work [21]. Subsequently, we manually checked all of them regarding their suitability to the STA-tasks. We discarded datasets that contained a majority of, 5 We use the following prefixes throughout this paper: dbr: http://dbpedia.org/resource/, dbo: http://dbpedia.org/ ontology/, rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#, rdfs: http://www.w3.org/2000/01/rdf-schema#, wd: http://www.wikidata.org/entity/, wdt: http://www.wikidata.org/prop/direct/, and owl: http://www.w3.org/ 2002/07/owl# 6 https://www.bexis.uni-jena.de/ 7 https://data.botanik.uni-halle.de/bef-china/ 8 https://data.world/ Table 1 Prevalence of challenges among the selected datasets. Lack Nested Numerical Missing Specimen Dataset Acronyms Typos of Synecdoche Entities Data Values Data Context dataworld_1 dataworld_2 dataworld_4 dataworld_6 dataworld_10 dataworld_27 befchina_1 befchina_6 befchina_20 Bexis_24867 Bexis_25126 Bexis_25786 Bexis_27228 e.g., internal database “ID” columns or numerical columns without any explanation or context. We consider those datasets are impossible to annotate automatically and of little benefit to the community. Consequently, we decided to include only datasets containing essential categorical information. We selected 6 out of 32 dataset from data.world, 4 out of 15 from BExIS, and 3 out of 25 from BEFChina. data.world provides the most suitable datasets for STA, thus, it contributes about half of the datasets in BiodivTab. Our analysis of the collected data shows that, in addition to common challenges, real-world datasets feature unique characteristics. We enumerate the encountered challenges in our sample of datasets. We summarize their prevalence in Table 1. • Nested Entities: more than one proper entity in a single cell, e.g., a chemical compound is combined with a unit of measurements. • Acronyms: Abbreviations of different sorts are common, e.g., “Canna glauca”, a particular kind of flower, is often referred to as “C.glauca” or “Ca.glauce”. • Typos: Data is predominantly collected manually by humans, so misspellings will occur, e.g., “Dead Leav” is used for “Dead Leaves”. • Numerical Data: Most of the collected datasets describe the specimen by various mea- surements in numerical form. • Missing Values: Data collected can be sparse and may include gaps, e.g., a column “super kingdoms” may consist of “unknown” values for the most part. • Lack of Context: The collected data may barely provide any informative context for semantic annotations. e.g., a column with a missing or severely misspelled header. • Synecdoche: Scientists may use a general entity as a short form to a more particular one, e.g., “Kentucky” is used instead of “Kentucky River”. • Specimen Data: The collected datasets contain observations of particular specimens or groups, but do not pertain to the species as a whole. 3.2. Manual Annotation & Biodiversity Expert Revision The annotation phase is the most time-consuming part of the benchmark creation since it included multiple rounds of revision. To ensure the quality of mappings, we manually annotated the selected tables with entities assembled from the live edition of Wikidata during September 2021, resulting in ground truth data for both CEA and CTA tasks. Concerning CEA, we have Table 2 Which type would be correct for the given taxons? Taxon Type (A) Type (B) Bacteria (wd:Q10876) superkingdom (wd:Q19858692 ) taxon (wd:Q16521) Actinobacteria (wd:Q130914) phylum (wd:Q38348) taxon (wd:Q16521) Actinobacteria (wd:Q26262282) class (wd:Q37517) taxon (wd:Q16521) Pseudonocardiales (wd:Q26265279) order (wd:Q36602) taxon (wd:Q16521) Pseudonocardiaceae (wd:Q7255180) family ( wd:Q35409) taxon (wd:Q16521) Goodfellowiella (wd:Q26219639) genus (wd:Q34740) taxon (wd:Q16521) Goodfellowiella coeruleoviolacea (wd:Q25859622) species (wd:Q7432) taxon (wd:Q16521) marked possible candidate columns, typically those with categorical values, to annotate their cells. For each cell value, we assembled possible matches via Wikidata’s built-in search. We manually selected the most suitable matches to disambiguate the cells semantically if we found multiple matches. If we could not have chosen only one annotation, we pick all possible ones and consider them true matches. Thus, the provided ground truth contains all proper candidates for a given cell value. Biodiversity experts reviewed around 1/3 of the annotations. This revealed an error rate of about 1%. Because of the low error rate, the effort of this step outweighs the benefits. Thus, we have decided to continue annotating the remainder without further revisions. We followed the same procedure for CTA. For categorical columns, we looked for a common type among column cells, taking into consideration the header value, to decide on the semantic type from Wikidata. Most of these columns are identified by the value of (wdt:P31, instance of) as the perfect annotation. However, finding such perfect annotation for taxon-related columns is not that easy. Since all taxon-related fields are instance of taxon. We believed it might not be distinguishable enough. In the biodiversity domain, experts are keen on more fine-grained modeling. E.g., species, genus, and class would be different types in their opinion. We established a simple one-question questionnaire for our biodiversity experts to select the perfect semantic type for a given taxonomic term as shown in Table 2. The first column shows the cell values with the corresponding mapping entities. The question is to select either which type is the most accurate, A, or B. We derive Type A from (wdt:P105, taxon rank) and Type B from (wdt:P31, instance of) in Wikidata. Based on their answers, the most fine-grained classification is (Type A); however, they consider (Type B) as a correct type as well. Thus, we have selected the perfect types for taxons through (wdt:P105, taxon rank). For numerical columns, most of them are identified by the column headers. We maintain separate ground truth files to ease manual inspection, revision, and quality assurance for each table. So, “befchina_1”, e.g., is annotated by two such files: “befchina_1_CEA” and “befchina_1_CTA”. The structure of the ground truth files follows the format of SemTab challenge. In particular, the solution files for CEA use a format of filename, column id, row id, and ground truth, whereas the ones for CTA employ a structure of filename, column id, and ground truth. 3.3. Data Augmentation We further used data augmentation to increase the number of tables in our benchmark and reduce the human effort needed. In our context, we introduced challenges to the existing datasets based on our findings during the data collection and analysis phase, thus we rely on real-world challenges that we added programatically to increase the amount of the data. Table 3 Table 3 Data augmentation technique per dataset. Merge Separate Add Fix Increase Alter No. Dataset Disambiguate Abbreviate Cols Cols Typos Typos Gap Cols Files dataworld_1 x3 - - - - - x3 x1 7 dataworld_2 x3 - x1 - - - x1 - 5 dataworld_4 x4 - - x1 x1 x2 x1 - 9 dataworld_6 - x1 - - - - - - 1 dataworld_10 - - - - - - x1 - 1 dataworld_27 x1 - x2 - - - - - 3 befchina_1 x2 - - - - - x1 - 3 befchina_20 x4 - - - - x1 - - 5 Bexis_24867 - - - - - x1 x2 - 3 Total 37 shows our used data augmentation techniques per dataset and the number of variations derived from it. In the following, we list techniques used and how they relate to the collected data issues: • Merge and Separate Columns we either by introduced new nested entities or splited them up into separate columns. • Add and Fix Typos we added noise to categorical cell values and, on rare occasions, fixed them. • Disambiguate we replaced concepts with more accurate ones, e.g., the state is replaced by the river it stands for. • Abbreviate we introduced more abbreviations especially with taxon-related values. • Alter Columns we removed one or more data columns. This results in less informative and sparse datasets. We managed to create the most variations from data.world since its datasets contain more categorical data that can be mapped to KG entities. Our data augmentation strategy increased the number of tables to 50 with less manual effort of the annotation. 3.4. CTA Ancestors Construction To enable approximation of CTA F1, Precision and Recall scores [4], we provide an ancestors ground truth to our perfectly annotated types. The corresponding file is structured in a key- value format with keys representing the perfect annotation and values listing parent classes. We refer to those parents as okay classes. Initially, we collected all unique column types from manually assigned perfect annotations. These are used to initialize a dictionary. Afterwards, we ran a sequence of three SPARQL queries sent to the public endpoint to retrieve related classes for each of them. For the first level, we query for direct types via (wdt:P31, instance of). We call them “E1”. For the second level, we query for further parent classes via (wdt:P279, subclass of) of the previous E1, resulting in “E2”. For the third and last level, we repeat the last process using the entities in E2, yielding “E3”. If the initial column type is a class (e.g., wd:Q60026969, unit of concentration) we skip the first step and only use the latter two. The resultant hierarchy consists of one perfect annotation with up to three levels of classes that are considered okay annotations. For taxon-related columns, we marked the (wdt:P105, taxonRank) as perfect annotation to follow the biodiversity experts’ recommendation. However, we have included (wd:Q16521, taxon) and (wd:Q21871294, living organism) as okay classes. 3.5. Assembly and Release For publication, we anonymized the file names of tables to use unique identifiers us- ing Python’s uuid functionalities. Subsequently, we aggregated the individual solutions of CEA and CTA-tasks into one file per task resulting in CEA_biodivtab_2021_gt.csv and CTA_biodivtab_2021_gt.csv respectively. We generated the corresponding “target-files” by removing the ground truth columns from these solution files. We provided anonymized tables alongside the target files to evaluate a particular system. The ground truth files alongside the dictionary for related classes, CTA-ancestors, are subsequently used to evaluate the results. Such way this follows the general approach of SemTab hiding the ground truth of STA-tasks from participants during the challenge. BiodivTab is awarded the first prize of IBM Research9 at the third round of 2021’s SemTab challenge [17] for its new challenges in CEA and CTA tasks. 3.6. DBpedia Ground Truth In 202210 , we included annotations from DBpedia that are based on the Wikidata annotations in two ways: First, we exploited the link between Wikidata entities and corresponding Wikipedia pages. As there is a one-to-one correspondence between Wikipedia pages and DBpedia entities, we generated a Wikidata-DBpedia-mapping for them. Second, we extracted owl:sameAs mappings between Wikidata and DBpedia to complete our mapping from DBpedia itself. Despite these direct mappings appeared promising to begin with, they contain serious data quality issues. As of April 2022, L-glutamic acid (wd:Q26995161) is mapped to 1772 entities within the DBpedia graph using owl:sameAs. Thus, the resulting mappings were again manually verified to ensure the overall quality of the final DBpedia ground truth data. Generated types for CTA contained only instances/resources from DBpedia. During the manual verification, we further added classes from the DBpedia ontology as well. We attempted to replicate our approach from Wikidata using rdf:type and rdfs:subClassOf to retrieve the CTA-ancestors. However, some relations in the DBpedia ontology seemed unreasonable to us. For example, DBpedia at the time of writing contains a triple dbr:Species rdf:type dbo:MilitaryUnit. For these and other similar scenarios, we decided to not include an ancestor file for DBpedia. 4. Evaluation In this section, we give a detailed overview of BiodivTab in terms of the size and content compared to existing benchmarks. In addition, we show the most and least frequent types of CTA. Finally, we demonstrate the application of our benchmark using the results of STA-systems during SemTab’s 2021 edition. 4.1. BiodivTab Characteristics Table 4 summarizes the selected datasets in terms of their original and selected size, and the number of CEA and CTA mappings. For large datasets, e.g., dataworld_4 and dataworld_27, we selected a subset of rows that retain the table characteristics. Most of the redundant species were dropped. Nevertheless, we kept the entire extent of BExIS datasets, including the redundant 9 https://www.research.ibm.com/ 10 The new ground truth data from DBpedia is going to be used in SemTab 2022, thus we release a new benchmark after the conclusion of the challenge. Table 4 Original and selected tables sizes, and entity and type mappings. Original Size Selected Size Mappings Dataset Rows Cols Rows Cols CTA CEA dataworld_1 332 18 100 18 4 210 dataworld_2 37 25 37 8 8 226 dataworld_4 42 337 67 100 40 26 476 dataworld_6 271 6 100 6 4 103 dataworld_10 497 15 100 13 11 902 dataworld_27 95 368 12 100 12 5 398 befchina_1 7 553 16 145 16 3 294 befchina_6 26 4 26 4 2 53 befchina_20 787 45 99 43 28 304 Bexis_24867 151 13 151 13 9 159 Bexis_25126 4 906 35 4 906 14 6 9 816 Bexis_25786 2 001 39 2 001 21 5 4 017 Bexis_27228 1 549 8 1 549 8 3 4 646 Total 114 21 604 Avg. 8.8 1 661,8 Table 5 Most and least frequent semantic types in BiodivTab. Most Frequent Least Frequent Wikidata Id Label Freq. Wikidata Id Label Freq. wd:Q7432 Species 39 wd:Q8066 amino acid 1 wd:Q706 calcium 26 wd:Q11173 chemical compound 1 wd:Q577 year11 19 wd:Q60026969 unit of concentration 1 wd:Q677 iron 16 wd:Q2463705 Special Protection Area 1 wd:Q731 manganese 16 wd:Q1061524 intensity 1 entries, to achieve a good balance between the large tables and those with the reasonable length for STA-systems. The column mappings show the characteristic of specimen data, those columns with only local measurements and with local database names that could not be matched to the KG. For example, only 4 out of 18 columns in dataworld_1 could be matched to KG-entities. Figure 3 shows the domains distribution of the 83 unique semantic types in the CTA-solutions. Approximately two-thirds of these types belong to the biodiversity domain. The distinction into the biodiversity-related, general domain, and mixed types was made according to the definitions introduced in [21, 22]. General domain types include, e.g., visibility, scale, cost, and airport. Mixed domain types contain examples like river, temperature, or sex of humans. Biodiversity-related types include taxon, chemical compounds, and soil type. In addition, Table 5 provides a list of most and least frequent semantic types in BiodivTab. Species (wd:Q7432) is the most frequent type, which reflects its importance in biodiversity research. 68.67% 15.66% 15.66% Biodiversity General Mixed Figure 3: Domain distribution in BiodivTab benchmark. Table 6 shows both data sources and target KGs or resource for BiodivTab and existing benchmarks. The three editions of SemTab from 2019 to 2021 [6, 7, 8] used both Wikidata and Wikipedia[23] as table sources. However, the target KGs varies between using DBpedia, 11 Calendar year, wd:Q3186692, is equivalent to year, wd:Q577. Table 6 Data sources for existing benchmarks and their corresponding targets. Entries for SemTab are aggregated over all rounds each. Dataset Data Source Target Annotation SemTab 2019 Wikidata, Wikipedia DBpedia SemTab 2020 Wikidata, Wikipedia Wikidata SemTab 2021 Wikidata, Wikipedia Wikidata, DBpedia T2Dv2 WebTables DBpedia Limaye Wikipedia DBpedia GitTables GitHub DBpedia, Schema.org BiodivTab BExIS, BEFChina, data.world Wikidata, DBpedia Wikidata, or both. T2Dv2 [11] and Limaye [10] use the WebTables [24] and Wikipedia as their data sources respectively while having annotations from DBpedia. GitTables [25] and the adapted version [9] for SemTab 2021 challenge, leverages GitHub as a table source and provide annotations from DBpedia and schema.org. Unlike all the previous benchmarks, BiodivTab uses domain-specific data portals, as table sources. It provides Wikidata annotations like SemTab 2020 and 2021. Table 7 shows a comparison between BiodivTab and existing benchmarks in terms of the average number of rows, columns, and cells. It also gives an overview of the targets for CEA, CTA, and CPA. BiodivTab is the smallest in terms of the number of tables. However, BiodivTab has the maximum average number of columns, and average number of rows except for SemTab 2021, Round 1, and BioTables in Round 2. This poses an additional challenge for STA systems. For CTA targets, BiodivTab is a middle point among the existing benchmarks. 4.2. Applications Table 8 shows the scores from SemTab2021 top participants on BiodivTab and HardTables during Round 3. Scores have been published by the organizers of SemTab2021 [17]. The details about the mentioned systems using BiodivTab are beyond the scope of this paper. For BiodivTab, CEA has maximum F1-score by JenTab [20] of 60.2%, while the CTA has a maximum score with 59.3% by KEPLER [26]. In contrast, for the synthetic dataset, HardTables, DAGOBAH achieved the maximum F1-score 97.4%, and 99% for CEA, and CTA respectively. These results show that annotating real-world, domain-specific tables is far from solved by state-of-the-art STA-systems. This underlines the importance of benchmarks like BiodivTab further to foster the transfer of academic projects to real-world applications. 4.3. Availability and Long-Term Plan Resources should be easily accessible to allow replication and reuse. We follow the FAIR (Findable, Accessible, Interoperable, and Reusable) guidelines to publish our contributions [14]. We release our dataset [29] in such a way that researchers in the community can benefit from it. In addition, we release the code [30] that was used to augment the data, assemble, and reconcile the benchmark. Our dataset and code are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) License and Apache License 2.0 respectively. 5. Conclusions and Future Work We introduced BiodivTab, the first biodiversity tabular benchmark for Semantic Table Annota- tion tasks. It consists of a collection of 50 tables. BiodivTab as created manually by annotating 13 Table 7 Comparison with existing benchmarks. ST19 - ST21 (SemTab editions). *_W and *_D use Wikidata and DBpedia as targets. ST21-H2, and H3 are HardTables for Round 2 and 3 during SemTab2021. ST21-Bio is BioTables at SemTab2021 Round 2. ST21-Git is the published version of GitTables during SemTab2021 Round 3. Avg. Rows Avg. Cols Avg. Cells Dataset Tables CEA CTA CPA (± Std Dev.) (± Std Dev.) (± Std Dev.) ST19-R1 64 142 ± 139 5±2 696 ± 715 8, 418 120 116 ST19-R2 11,924 25 ± 52 5±3 124 ± 281 463, 796 14,780 6,762 ST19-R3 2,161 71 ± 58 5±1 313 ± 262 406, 827 5,752 7,575 ST19-R4 817 63 ± 52 4±1 268 ± 223 107, 352 1,732 2,747 ST20-R1 34,294 7±4 5±1 36 ± 20 985,110 34,294 135,774 ST20-R2 12,173 7±7 5±1 36 ± 18 283,446 26,726 43,753 ST20-R3 62,614 7±5 4±1 23 ± 18 768,324 97,585 166,633 ST20-R4 22,390 109 ± 11, 120 4±1 342 ± 33, 362 1,662,164 32,461 56,475 ST21-R1_W 180 1, 080 ± 2, 798 5±2 4125 ± 10947 663,655 539 NA ST21-R1_D 180 1, 080 ± 2, 798 4±2 3, 952 ± 10, 129 636,185 535 NA ST21-H2 1,750 17 ± 8 3±1 55 ± 32 47,439 2,190 3,835 ST21-Bio 110 2, 448 ± 193 6±1 14, 605 ± 2, 338 1,391,324 656 546 ST21-H3 7,207 8±5 2±1 20 ± 15 58,948 7,206 10,694 ST21-Git 1,101 58 ± 95 16 ± 12 690 ± 1, 159 NA 2,516 NA ST21-Git 1,101 58 ± 95 16 ± 12 690 ± 1, 159 NA 720 NA T2Dv2 779 85 ± 270 5±3 359 ± 882 NA 237 NA Limaye 428 24 ± 22 2±1 51 ± 50 NA 84 NA BiodivTab_W 50 259±743 24±13 4,589 ±10,862 33,405 614 NA BiodivTab_D 50 259±743 24±13 4,589 ±10,862 33,405 569 NA Table 8 SemTab2021 top participants’ scores for BiodivTab and HardTables 3 benchmarks. F1 - F1 Score, Pr - Precision, R - Recall, AF1, APr, and AR - Approximate version of F1 Score, Precision, and Recall respectively. Highest scores are in bold. BiodivTab HardTables 3 CEA CTA CEA CTA System F1 Pr R AF1 APr AR F1 Pr R AF1 APr AR MTab [19] 0.522 0.527 0.517 0.123 0.282 0.079 0.968 0.968 0.968 0.984 0.984 0.984 Magic [27] 0.142 0.192 0.112 0.1 0.253 0.063 0.641 0.721 0.577 0.687 0.687 0.688 DAGOBAH [18] 0.496 0.497 0.495 0.381 0.382 0.38 0.974 0.974 0.974 0.99 0.99 0.99 mantisTable [28] 0.264 0.785 0.159 0.061 0.076 0.051 0.959 0.984 0.935 0.965 0.973 0.958 JenTab [20] 0.602 0.611 0.539 0.107 0.107 0.107 0.94 0.94 0.939 0.942 0.942 0.942 KEPLER [26] NA NA NA 0.593 0.595 0.591 NA NA NA 0.244 0.279 0.217 tables from real-world biodiversity datasets and adding 37 more tables by augmenting them with noise based on challenges that are commonly observed in the domain. The target knowledge graphs for annotations are Wikidata and DBpedia. An evaluation during SemTab2021 showed that current state-of-the-art systems still struggle with the challenges posed. This highlights BiodivTab’s importance for further development in the field. BiodivTab itself and the code used to create it are publicly available. Future Work We see multiple directions to continue this work. We plan to include more biodiversity tables from other projects to cover a broader domain spectrum. We also plan to apply further quality checks of the annotations like multiple-annotators annotation and validation via the interrater agreement. In addition, we plan to provide ground truth data from other knowledge graphs, particularly domain-specific ones. Moreover, we analyze the performance of STA-systems on the BiodivTab. Acknowledgments The authors thank the Carl Zeiss Foundation for the financial support of the project “A Virtual Werkstatt for Digitization in the Sciences (P5)” within the scope of the program line “Breakthroughs: Exploring Intelligent Systems” for “Digitization - explore the basics, use applications”. We thank our biodiversity experts Cornelia Fürstenau and Andreas Ostrowski for feedback and validation of the created annotations. References [1] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the ACM 57 (2014) 78–85. doi:10.1145/2629489. [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Dbpedia: A nucleus for a web of open data, in: The semantic web, 2007, pp. 722–735. [3] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, SemTab 2019: Resources to benchmark tabular data to knowledge graph matching systems, in: European Semantic Web Conference, Springer, 2020, pp. 514–530. [4] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results of SemTab 2020, in: CEUR, volume 2775, 2020, pp. 1–8. [5] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results of SemTab 2021, in: CEUR Workshop Proceedings, 2021. [6] O. Hassanzadeh, V. Efthymiou, J. Chen, E. Jiménez-Ruiz, K. Srinivas, SemTab 2019: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching - 2019 Data Sets, 2019. doi:10.5281/zenodo.3518539. [7] O. Hassanzadeh, V. Efthymiou, J. Chen, E. Jiménez-Ruiz, K. Srinivas, SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets, 2020. doi:10. 5281/zenodo.4282879. [8] O. Hassanzadeh, V. Efthymiou, J. Chen, E. Jiménez-Ruiz, K. Srinivas, SemTab 2021: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets, 2021. [9] M. Hulsebos, Ç. Demiralp, P. Groth, GitTables: A Large-Scale Corpus of Relational Tables, arXiv preprint arXiv:2106.07258 (2021). [10] G. Limaye, S. Sarawagi, S. Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment 3 (2010) 1338–1347. [11] O. Lehmberg, D. Ritze, R. Meusel, C. Bizer, A large public corpus of web tables contain- ing time and context metadata, in: Proceedings of the 25th International Conference Companion on World Wide Web, 2016, pp. 75–76. [12] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough Tables: Carefully Evaluating Entity Linking for Tabular Data, 2020. doi:10.5281/zenodo.4246370. [13] U. Küster, B. König-Ries, Towards standard test collections for the empirical evaluation of semantic web service approaches, International Journal of Semantic Computing 2 (2008) 381–402. [14] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al., The FAIR Guid- ing Principles for scientific data management and stewardship, Scientific data 3 (2016) 1–9. [15] L. M. Gadelha Jr, P. C. de Siracusa, E. C. Dalcin, L. A. E. da Silva, D. A. Augusto, E. Krempser, H. M. Affe, R. L. Costa, M. L. Mondelli, P. M. Meirelles, et al., A survey of biodiversity informatics: Concepts, practices, and challenges, Wiley Interdisciplinary Reviews 11 (2021) e1394. [16] N. Abdelmageed, S. Schindler, B. König-Ries, BiodivTab: A table annotation benchmark based on biodiversity research data, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR, CEUR-WS.org, 2021, pp. 13–18. [17] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jiménez-Ruiz, J. Sequeda, K. Srinivas, N. Abdelmageed, M. Hulsebos, D. Oliveira, et al., Results of SemTab 2021, Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 3103 (2022) 1–12. [18] V.-P. Huynh, J. Liu, Y. Chabot, T. Labbé, P. Monnin, R. Troncy, DAGOBAH: Table and Graph Contexts For Efficient Semantic Annotation Of Tabular Data., in: SemTab@ ISWC, 2021. [19] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, SemTab 2021: Tabular Data Annotation with MTab Tool., in: SemTab@ ISWC, 2021. [20] N. Abdelmageed, S. Schindler, JenTab Meets SemTab 2021’s New Challenges., in: SemTab@ ISWC, 2021. [21] N. Abdelmageed, A. Algergawy, S. Samuel, B. König-Ries, BiodivOnto: Towards a core ontology for biodiversity, in: European Semantic Web Conference (ESWC), Springer, 2021, pp. 3–8. [22] F. Löffler, V. Wesp, B. König-Ries, F. Klan, Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?, PloS one 16 (2021) e0246099. [23] C. S. Bhagavatula, T. Noraset, D. Downey, Methods for exploring and mining tables on wikipedia, in: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, 2013, pp. 18–26. [24] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang, Webtables: exploring the power of tables on the web, VLDB Endowment 1 (2008) 538–549. [25] M. Hulsebos, Çağatay Demiralp, P. Demiralp, Gittables for SemTab 2021 - cta task, 2021. doi:10.5281/zenodo.5706316. [26] W. Baazouzi, M. Kachroudi, S. Faiz, Kepler-aSI at SemTab 2021., in: SemTab@ ISWC, 2021. [27] B. Steenwinckel, F. De Turck, F. Ongenae, MAGIC: Mining an Augmented Graph using INK, starting from a CSV., in: SemTab@ ISWC, 2021. [28] R. Avogadro, M. Cremaschi, MantisTable V: A novel and efficient approach to Semantic Table Interpretation., in: SemTab@ ISWC, 2021. [29] N. Abdelmageed, S. Schindler, B. König-Ries, fusion-jena/BiodivTab, 2022. doi:10.5281/ zenodo.6461556. [30] N. Abdelmageed, S. Schindler, B. König-Ries, fusion-jena/biodivtab: Benchmark data and code, 2021. doi:10.5281/zenodo.5749340.