Transformer-based Subject Entity Detection in Wikipedia Listings Nicolas Heist1,∗ , Heiko Paulheim1 1 Data and Web Science Group, University of Mannheim, Germany Abstract In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - and in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities on token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extract 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph. Keywords Subject Entity Detection, Named Entity Recognition, Wikipedia Listings, CaLiGraph 1. Introduction 1.1. Motivation Background knowledge provides an essential advantage in tasks like text summarisation or question answering. With ready-to-use entity linking tools like Falcon [1], entities in text can be identified and additional information can be drawn from background knowledge graphs (e.g. DBpedia [2] or CaLiGraph1 [3]). Of course, this is only possible if the necessary information about the entity is included in the knowledge graph [4]. Hence, it is important to equip knowledge graphs with as much entity knowledge as possible. While this is easily possible for prominent entities that are mentioned frequently, the retrieval of information about long-tail and emerging entities that are mentioned only very infrequently is tedious [5]. Still, approaches for automatic information extraction can be applied to increase ISWC 2022: Deep Learning for Knowledge Graphs, October 23–27, 2022, Virtual Conference ∗ Corresponding author. Envelope-Open nico@informatik.uni-mannheim.de (N. Heist); heiko@informatik.uni-mannheim.de (H. Paulheim) GLOBE http://www.uni-mannheim.de/dws/people/researchers/phd-students/nicolas-heist/ (N. Heist); http://www.heikopaulheim.com/ (H. Paulheim) Orcid 0000-0002-4354-9138 (N. Heist); 0000-0002-4354-9138 (H. Paulheim) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 http://caligraph.org Gilby Clarke Page Title --- -- ---- -- - ---- Discography Section -- ---- - --- - --- Albums with Guns N' Roses - The Spaghetti Incident? (1993) Listing 1 - Greatest Hits (1999) Albums with Nancy Sinatra - California Girl Listing 2 Solo albums Name Year -- Rubber 1998 --- Listing 3 Swag 2001 - ... Figure 1: Simplified view on the listings of the Wikipedia page of Gilby Clarke. the coverage of knowledge graphs to a certain extent. One strand of research is concerned with open information extraction systems that try to extract facts from web text (e.g. [6, 7]). While they perform strongly on well-known entities, the extraction quality for long-tail entities is considerably worse [6]. The extraction of information from semi-structured data is in general less error-prone and has already proven to yield high-quality results as, for example, DBpedia itself is extracted primarily from Wikipedia infoboxes; other approaches use the category system of Wikipedia [8, 9, 10]; many more approaches focus on tables (in Wikipedia or the web) as a semi-structured data source to extract entities and relations (see [11] for a comprehensive survey). In this work, we generalize over structures like enumerations (Listings 1 and 2) and tables (Listing 3 in Figure 1) by simply considering them as listings with listing items (i.e., enumeration entries or table rows). Further, we call the main entity, that a listing item is about, a subject entity (SE). In previous work, we defined SEs as all entities in a listing appearing as instances to a common concept [12]. In case of Figure 1, the SEs are the mentioned albums (e.g. The Spaghetti Incident? or California Girl). Here, the common concept is made explicit through the section labels above the listings (Albums with..), but it may as well be the case that it is only implicitly defined through the respective SEs. As a listing item typically mentions only one SE together with some context (in this case, the publication year of the album), we assume that at most one SE per listing item exists. In the English Wikipedia chapter alone, we find almost five million listings in roughly two million articles. From our estimation, about 80% of the listings are suitable for the extraction of SEs, bearing an immense potential for knowledge graph completion (for details, see Section 3.1). Upon extraction, they can easily be digested by downstream applications: Due to the semi- structured nature of listings, the quality of extraction is higher than extraction from plain text, and SEs are typically extracted in groups of instances sharing a common concept (as given by the definition above). Especially the latter point makes subsequent disambiguation step much easier, as the group of extracted instances provides context for every individual instance. Another example of the downstream use of SEs is a work of ours where we used groups of SEs to learn lexical patterns that entail axioms [12]. For example, if a listing is in a section that starts with Albums with, we learn that the SEs are of the type Album. The combination of these two ideas, i.e. of extracting novel SEs and learning defining axioms for them, can bring a big benefit. In Figure 1, instead of simply discovering California Girl as a new entity, we additionally assign the type Album. Thinking further, we can learn an axiom that all albums mentioned in the discography of Gilby Clarke are albums that are authored by him. The additional information can be used to refine the description of the extracted entity in the knowledge graph. 1.2. Problem Statement Given an arbitrary listing, we want to identify the SEs among all entities mentioned in the listing. In the literature, there are only very few approaches that deal with this problem. The most related approach is a previous work of the authors that is concerned with the detection of SEs in Wikipedia list pages [3].2 The approach uses a hand-crafted set of features to classify entities in tables or enumerations of list pages as SEs. However, the approach has several limitations: • It is only applicable to list pages and not to listings in any other context as the features are primarily designed for the list page context. • Dependencies between individual SEs of listing items are not taken into account as the classification is done separately for every item. • The approach needs mention boundaries of entities as input for the classification. Con- sequently, it cannot identify any new entities but only categorize existing entities into subject and non-subject entities. 1.3. Contributions To harness the information expressed through SEs in more general settings, we aim to over- come the previously mentioned limitations in this work. In particular, we make the following contributions: • We present a Transformer-based approach for SE detection with a flexible input format that allows us to apply it to any kind of listing. Further, the model takes dependencies between listing items into account (Section 4.1). • During prediction, the approach detects SEs end-to-end without relying on mention boundaries of the entities in the input sequence (Section 4.2). • We introduce a novel mechanism for generating negative samples of listings (Section 4.3) and a fine-tuning mechanism on noisy listing labels (Section 4.4) leading to more accurate prediction results. 2 List pages are special Wikipedia pages that contain only listings describing entities of a certain topic. • In our evaluation, we show that the performance of our approach is superior to previous work (Section 5.3); further, we analyse its performance in a more general scenario - that is, arbitrary listings of Wikipedia pages (Section 5.4). • We run the extraction of SEs on the complete Wikipedia corpus and incorporate the results in a new version of CaLiGraph (Section 5.6). The produced code is publicly available and part of the CaLiGraph extraction framework.3 2. Related Work With the presented approach we detect SEs end-to-end, directly from listing text. For a given listing, we identify mentions of named entities and decide at the same time whether they are SEs of a listing or not. In the following, we first review Named Entity Recognition (NER) and subsequently discuss approaches that detect SEs. 2.1. Named Entity Recognition NER is a subproblem of Entity Linking (EL) which only tries to identify mentions of named entities in the text without actually disambiguating them [13]. As opposed to general Entity Recognition, NER only deals with the identification of named entities and ignores the linking of concepts (also called Wikification) [14]. Early NER systems were based on hand-crafted rules and lexicons, followed by systems using feature-engineering and machine learning [15]. One of the first competitive NER systems that used neural networks has been presented by Collobert et al. in 2011 [16]. This eventually lead to more sophisticated architectures based on word embeddings and LSTMs (e.g. from Lample et al. [17]). With the rise of transformer networks [18] like BERT [19] in 2018, they also found their direct application in NER (e.g. by Liang et al. [20]), or as part of an end-to-end EL system like the one from Broscheit [21]. The latter uses a simple but effective prediction scheme, where entities are predicted at token-level and multiple subsequent tokens with the same predicted entity are collapsed into the actual entity prediction. In our work, we use a similar token-level prediction scheme to detect SEs. 2.2. Subject Entity Detection Although SE detection has not explicitly been addressed in the literature very frequently, there are some approaches that deal with related problems or subproblems of it. In table interpretation, an important task is the identification of the subject column, i.e. the column containing the entity with outgoing relations to all other columns. TAIPAN [22] is an approach that aims to recover the semantics of tables and names subject column identification as the first major task towards relation extraction in tables. To identify subject columns, they choose the columns having entities with the most outgoing edges to entities in other columns w.r.t. a background knowledge graph. While this is a viable approach for tables that are already annotated with 3 https://github.com/nheist/CaLiGraph entities, it is not broadly applicable to general listings that may not have many known (or even annotated) entities. Another related approach is from Zhao et al. [23] who deal with a problem which they call key entity detection. Primarily, they do sentiment analysis in financial texts and use the detection of key entities - which they define to be subjects of events related to financial information - in order to attribute the positive or negative sentiment to a concrete entity. Similar to our proposed approach, they use a Transformer to detect key entities. However, they only use it to select the key entities from a predefined set of entities and ignore the NER part. As mentioned in the introduction, the most closely related approach is the authors’ prior work [3]: using manually defined features and a binary XGBoost classifier, entities on list pages are classified into either subject entities or non-subject entities. For the page List of Japanese speculative fiction writers,4 for example, all entities in the enumerations that are Japanese speculative fiction writers are classified as SEs. More concretely, the approach uses page features (e.g. number of sections or tables on the page), positional features (e.g. indentation level of entry in the enumeration), and linguistic features (e.g. whether the column header is synonymous with the list page title). Overall, SEs are extracted with a precision of 90% and a recall of 67%. The classifier is trained and evaluated with a set of list pages that are annotated through distant supervision with DBpedia for background knowledge. This part is discussed in detail in Section 3.2 as the approach presented here relies on this training data generation strategy as well. 3. Preliminaries 3.1. Listings in Wikipedia Overall, the English Wikipedia has more than five million articles. Roughly two million of them contain at least one listing in the form of an enumeration or a table. All over these pages, we find 3.5 million enumerations and 1.4 million tables.5 The roughly 90K list pages in Wikipedia contain the most structured and easily exploitable form of listings. Here, listings are almost exclusively used to list a number of entities that have some common property (e.g. all Japanese speculative fiction writers). Listings that appear on other Wikipedia pages are used for this purpose as well but not exclusively, which makes the detection of SEs much more complex. From the inspection of a sample of Wikipedia listings, we estimate that approximately 85% of enumerations and 67% of tables are usable for our approach. Especially enumerations are often used to simply structure content (e.g. to list the individual episodes in a biography). But even if listings are used to describe entities, they may not be usable due to various reasons: • Entity description without explicit mention (example in Figure 2a) • Description of the properties of a single entity (example in Figure 2b) • Listing items contain groups of entities (example in Figure 2c) 4 https://en.wikipedia.org/wiki/List_of_Japanese_speculative_fiction_writers 5 These numbers exclude very small listings with less than three items, which we do not consider. (a) Listing containing no explicit mention of the entities (Source: https://en.wikipedia.org/wiki/Sunrisers_ Hyderabad_in_2018) (b) Listing describing the properties of an entity (Source: https://en.wikipedia.org/wiki/Dynamic_HTML) (c) Listing containing groups of entities (Source: https://en.wikipedia.org/wiki/Ibiza_(Vino_de_la_Tierra)) Figure 2: Examples of Wikipedia page listings with layout or content that is challenging for SE detection. Especially the first point renders a big portion of tables useless for our approach as an entity is implicitly described through entities and literals mentioned in multiple table columns (e.g. a sports match is described through date, player, opponent, and result). 3.2. Distantly-Supervised Training Data Generation for List Pages In our experiments, we will use the training data generation strategy that we introduced in previous work [3]; to make this paper self-contained, we will give an overview of this strategy here. The strategy is based on the observation that DBpedia classes, Wikipedia categories, and Wikipedia list pages can be transformed into an immense taxonomy through linguistic and statistical methods. For example, the taxonomy contains the hierarchy Person > Writer > Speculative fiction writer > Japanese speculative fiction writer. The first two elements originate from DBpedia classes, the third from a category, and the last from a list page. As a consequence, we can use this hierarchy to infer the DBpedia classes of SEs for many list pages. To label the list page List of Japanese speculative fiction writers, we assign every entity with the DBpedia class Writer a positive label and every entity with a class that is disjoint with Writer a negative label. Then we include all listing items into our training set that either have an entity with a positive label or only entities with negative labels. Other listing items are ignored as we cannot be certain that they may contain SEs which we could not identify due to the incompleteness of DBpedia. The knowledge graph CaLiGraph [9, 3] uses this extended taxonomy of DBpedia classes, categories, and list pages as a type hierarchy, and enriches the original DBpedia instances with additional, more fine-grained types. Furthermore, CaLiGraph contains a higher number of instances than DBpedia as it additionally contains the extracted SEs from list pages. 3.3. Transformers for Token Classification Pre-trained transformer networks [18] like BERT [19] or DistilBERT [24] produced new state-of- the-art results for various NLP tasks including NER and question answering. To a large extent, their ubiquitous application is due to the fact that only a comparably small amount of fine-tuning is necessary to fit them to various tasks. BERT, for instance, consists of 12 multi-head attention layers followed by a simple linear layer as classification head. To apply a transformer model to a token classification problem, it is oftentimes sufficient to fine-tune the final classification head. The input for a transformer model can consist of plain text and needs to be tokenized before it can be processed. Every word in the input sequence is transformed into one or more tokens (if the word is not contained in the vocabulary, multiple word-piece tokens are used). Further, the input sequence has to contain special tokens that indicate, for example, the start and the end of the sequence. Using BERT for token classification, the input sequence has a fixed length of 512 tokens, has to start with a [CLS] token and end with a [SEP] token. Additional special tokens may be introduced to provide more context information to the model. 4. Subject Entity Detection with Transformers To detect SEs in listings, we phrase the problem as a token classification problem where we, similar to the work of Broscheit [21], produce a label for every token of the input sequence. In a subsequent step, we aggregate the token labels to predictions of SE mentions. We use 13 different token labels, such as Person or Organisation, to identify SEs and additionally make a prediction of their types (refer to Table 5 for the full list of labels). In Section 4.1 we explain how to create input sequences that preserve the context and the structure of a listing. In Section 4.2 we show our choice of labels for SE prediction, and in Section 4.3 we introduce a mechanism to generate negative samples of listings. Finally, Section 4.4 explains how to use noisy SE labels on page listings for further fine-tuning of our models. 4.1. Token-level Subject Entity Detection To pass a listing for SE detection to the transformer model, we use multiple special tokens in order to encode context information (page, section, potential table header) and structural information (entries, rows, columns) of the listing into the input sequence. Every sequence consists of the listing context, followed by the special token indicating the end of context [CXE], and one or more listing items: [CLS] [CXE] [SEP] We use the special token [CXS] to separate context elements. Within listing items, table rows and columns are indicated with [ROW] and [COL], respectively. For enumerations, we use the tokens [E1] to [En] to indicate the start of an entry with the indentation level 1 to n. Ignoring that some words may be split into multiple tokens, the input for the first listing item of Listing 1 in Figure 1 looks as follows: [CLS] Gilby Clarke [CXS] Discography [CXS] Albums with Guns N’ Roses [CXE] [E1] The Spaghetti Incident? (1993) [SEP] We want the model to take dependencies between listing entities into account. For example, if the SE in the first listing item is mentioned right in the beginning, it is very likely that this is the case for the remaining listing items as well. Instead of only providing one listing item per input sequence, we can provide as many as the input sequence length permits. Through the attention layers within the Transformer architecture, the model is able to take these dependencies within the input sequence into account. Hence, we put Listing 1 into one input sequence: [CLS] Gilby Clarke [CXS] Discography [CXS] Albums with Guns N’ Roses [CXE] [E1] The Spaghetti Incident? (1993) [E1] Greatest Hits (1999) [SEP] Likewise, we encode Listing 3 as one input sequence: [CLS] Gilby Clarke [CXS] Discography [CXS] Solo albums [CXS] [ROW] Name [COL] Year [CXE] [ROW] Rubber [COL] 1998 [ROW] Swag [COL] 2001 [SEP] If the listing is too long to fit into one input sequence, we split the listing items into chunks and process them one after another. Each chunk is augmented with the same context information and a different set of listing items. Depending on the length of listing items, it is possible to fit 20 or more items into one input sequence. In our ablation study in Section 5.5 we show that this item chunking strategy has a strongly positive effect on the recall of the model. But apart from that we immensely reduce the run time of the model for training and prediction. The number of processed input sequences is reduced by a factor that is roughly equivalent to the median number of items per listing.6 6 We deliberately use the median and not the average of items per listing as large listings will be split into multiple input sequences due to the size limitation. 4.2. Coarse-grained Entity Type Prediction The most common notation to tag tokens in NER is the BIO notation (Begin, Inside, and Outside of an entity) together with an entity type (e.g. Person or Organisation). We decided not to use the BIO notation as, per definition, there is at most one SE per listing item. Instead of making the task even simpler and getting rid of the entity type prediction in favor of a simple binary SE prediction task as well, we decided to stick with the coarse-grained entity type prediction. This has the advantage that the entity types can be used as additional information in downstream tasks - most importantly in a subsequent entity disambiguation step. In addition to that, we show in our ablation study in Section 5.5 that the more difficult task of entity type prediction even slightly increases the precision of the model. Context and special tokens are annotated with the IGNORE label to indicate the model that we need no prediction for these tokens. SEs are annotated with the respective entity type, everything else is annotated with NONE. Again ignoring word-piece tokenization, the labels for Listing 1 of Figure 1 look as follows: IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE IGNORE WORK_OF_ART WORK_OF_ART WORK_OF_ART NONE IGNORE WORK_OF_ART WORK_OF_ART NONE IGNORE 4.3. Negative Sampling through Shuffled Listings It is difficult to find negative examples of complete listings if the training data is generated heuristically and with distant supervision as described in Section 3.2. Positives can be found easily (i.e., there is an entity in the listing item that has the correct type), but the inverse does not always hold. If we do not find a positive, this may mean that the listing item does not contain one, but it is as well possible that the annotation is missing. From a logical standpoint, it is even unlikely that some items in a listing contain SEs while others do not. To mitigate this problem, we equipped our approach with a sampling mechanism for negatives that randomly assembles them from the contexts and items of all positives in the training set. If the context and items are assembled randomly, the differences between the individual items (and the difference in the context) should be higher than in a real listing. The intention of this mechanism is that the model learns to identify the coherence between SEs of listing items as well as between items and the context. For enumeration listings, the mechanism is simple as we pick the context from one listing and a random number of items (between three and the maximum number of items per chunk) from other listings. For table listings, we have to take care that the number of columns of an assembled listing is consistent. Hence, the positives from the training set are divided into groups of the same column size and listings are only assembled from within a single group. A negative example produced from four different listings could look as follows: [CLS] Gilby Clarke [CXS] Discography [CXS] Albums with Guns N’ Roses [CXE] [E1] James Stewart as Billy Jim Hawkins [E1] Curzon Mill Company, part of Ashton syndicate. [E1] Brepholoxa Van Duzee, 1904 [SEP] The mechanism has exactly one hyper-parameter which is the proportion of negative listings to generate. We experiment with values between 0.0 (no negative samples at all) and 1.0 (as many negatives as we have positives). 4.4. Fine-Tuning on Noisy Page Labels The training data generation strategy described in Section 3.2 lets us create labels for listings of list pages that we use for the initial training of our models. To train a model that works well on listings of any pages, additional training data of listings that are not on list pages may be beneficial (the differences in listings have been described in Section 3.1). We gather this data by first training a model using the heuristically labelled list pages. We apply the model to listings of all pages for noisy labels of SEs. We then filter them by discarding any listings where multiple types of SEs have been predicted (e.g., if the first SE of a listing is labelled as PERSON and the second is labelled as WORK_OF_ART ). 5. Experiments The goal of our experiments is to compare the performance of our approach against previous work on SE detection in list pages (Section 5.3) and evaluate its performance in the more general setting of Wikipedia page listings (Section 5.4). Further, we analyze some of our design choices in an ablation study (Section 5.5). Finally, we apply our best model to the complete Wikipedia corpus and report our extraction results (Section 5.6). 5.1. Metrics For the evaluation of our SE detection models, we stick to the common metrics for NER introduced in SemEval-2013 [25]. We report precision, recall, and F1-scores of the following scenarios: • Partial: Prediction matches the boundary of the true entity at least partially. • Exact: Prediction exactly matches the boundary of the true entity. • Ent-Type: At least partial boundary match and entity type matches. • Strict: Predicted boundary and type exactly match with the true entity. 5.2. Datasets In the experiments, we primarily focus on Wikipedia as a data corpus due to its encyclopedic structure and the convenient mapping of entities to DBpedia and CaLiGraph. From the main dataset 𝐷 which consists of all Wikipedia pages that contain listings, we create the subsets D-LPtrain and D-LPtest (from list pages) as well as D-Ptrain and D-Ptest (from any pages with listings). The statistics of the datasets are shown in Table 1. For the experiments, we use a dump of the English Wikipedia from October 2020 to be compatible with the latest release of CaLiGraph. Table 1 Statistics of the datasets used for the experiments. The complete corpus 𝐷 contains all Wikipedia pages that have listings. D-LPtrain and D-LPtest are extracted from all Wikipedia list pages and are labelled through distant supervision; D-Ptrain contains listings from arbitrary pages and contains noisy labels from a model trained on list pages while D-Ptest is annotated manually. Dataset #Pages #Listings Items per Listing (Avg.) Items per Listing (Med.) Enums Tables Enums Tables Enums Tables 𝐷 1,980,021 3,463,053 1,352,848 10.57 14.43 6 8 D-LPtrain 68,494 289,666 116,715 18.06 31.26 8 12 D-LPtest 17,123 75,063 28,688 18.17 31.32 8 12 D-Ptrain 546,667 663,455 306,399 18.72 24.53 12 13 D-Ptest 502 763 265 8.42 11.25 6 7 Table 2 Evaluation results for SE detection on Wikipedia list pages (evaluating on D-LPtest ). Precision, recall and F1-score (in %) are given for the Exact scenario. 𝑂𝑢𝑟𝑠𝐿𝑃 is the best configuration for D-LPtest while 𝑂𝑢𝑟𝑠𝑃 is the best configuration for D-Ptest using D-LPtrain as training data. Approach Enums Tables Overall P R F1 P R F1 P R F1 Heist and Paulheim [3] 91 82 86 90 55 68 90 67 77 𝑂𝑢𝑟𝑠𝐿𝑃 93 94 94 89 87 88 92 91 92 𝑂𝑢𝑟𝑠𝑃 92 93 93 88 86 87 91 90 91 The datasets D-LPtrain and D-LPtest are created as explained in Section 3.2. For the experi- ments, we use a part of D-LPtrain for validation so that we have a distribution of 60% training, 20% validation, and 20% test set (similar to [3]). The datasets D-Ptrain and D-Ptest consist of listings from arbitrary Wikipedia pages. Hence, no type information is available to infer the SE labels through distant supervision. For D-Ptrain , we retrieved the labels as described in Section 4.4. For D-Ptest , we provided the type information by manually annotating the roughly 1K listings with coarse-grained entity types (e.g. Person or Organisation). We mapped these types to their DBpedia counterparts and used this information to infer the SE labels via distant supervision. This substantially reduced the annotation effort from labelling roughly 10K listing items with concrete SE labels to labelling 1K listings with coarse-grained types. This implies that this dataset is also, in part, heuristically created and the results have to be taken with a grain of salt. 5.3. Evaluation on Wikipedia List Pages The evaluation results for experiments on the dataset D-LPtest are given in Table 2. We compare the approach Heist and Paulheim [3] with our model in the two configurations 𝑂𝑢𝑟𝑠𝐿𝑃 7 and 7 Configuration: Model roberta-base trained for 3 epochs with batch size 64, learning rate 5e-5, no warmup or weight decay, negative sample size 0.5 Table 3 Precision, recall and F1-score (in %) for SE detection on Wikipedia page listings (evaluating on D-Ptest ) using our best model configuration 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 . Metric Enums Tables Overall P R F1 P R F1 P R F1 Partial 76 78 77 68 82 74 73 79 76 Exact 73 76 75 67 81 73 71 77 74 Ent-Type 76 78 77 65 78 71 73 78 75 Strict 71 74 73 64 77 70 69 75 72 Baseline 23 85 36 21 90 34 23 86 36 𝑂𝑢𝑟𝑠𝑃 .8 Both configurations are trained with training part of D-LPtrain and tuned using the evaluation part. The former model configuration is the best one w.r.t. performance on D-LPtest . The latter model configuration is the best one w.r.t. performance on D-Ptest . To train our models, we use the Huggingface transformer library [26]. Both of our model configurations significantly outperform the existing approach Heist and Paulheim [3], especially in terms of recall for both enumerations and tables, showing that our model can identify substantially more entities while keeping a high level of precision. For enumerations, the precision increased slightly and the recall is over ten percent higher. While precision is kept almost constant for tables, the recall increased by more than 30 percent. 5.4. Evaluation on Wikipedia Page Listings The evaluation results for the model 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 9 on D-Ptest is given in Table 3. Comparing the Exact scenario with the results on Wikipedia list pages, it becomes clear that the performance on arbitrary listings is worse. The losses in performance for tables are slightly higher than those for enumerations. This aligns with the observation that a lower portion of tables is usable for our approach. For tables, we have the advantage that mention boundaries are often indicated through column separators but this does not reflect in the results. In general, we notice that training the models for more than two to three epochs on D-LPtrain leads to overfitting on list page data and hence reduced performance on D-Ptest . Unfortunately, it is not possible to apply the approach Heist and Paulheim [3] to this dataset as it contains several features that are specific to list pages. As an alternative, we implemented the pick-first-entity baseline which has already proven as a strong baseline in prior work [3]. In this baseline, we simply label the first mentioned entity in an item as SE. In Table 3 we see that this baseline has a very high recall (as most SEs are mentioned in the beginning) while the precision is far lower than the one of 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 . This shows that the model is able to sort out many false positives (tripling precision) by sacrificing only some correct SEs. In cases where coverage is not the only important criterion (as is usually the case in knowledge graph completion), our model should be preferred. The more important point, however, is that our 8 Configuration: Model roberta-base trained for 2 epochs with batch size 64, learning rate 5e-5, no warmup or weight decay, negative sample size 0.3 9 Configuration: Similar to 𝑂𝑢𝑟𝑠𝑃 with an additional fine-tuning step of one epoch on D-𝑃𝑡𝑟𝑎𝑖𝑛 . Table 4 Evaluation results for SE detection on Wikipedia page listings (evaluated on D-Ptest ) for variations of our best model configuration 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 . Precision, recall and F1-score (in %) are given for the Exact scenario. Approach Enums Tables Overall P R F1 P R F1 P R F1 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 73 76 75 67 81 73 71 77 74 .. without item chunks 70 35 47 63 40 49 68 37 48 .. without type prediction 69 78 73 54 84 66 64 79 71 .. without negative sampling 71 74 73 66 81 73 70 76 73 .. without fine-tuning on pages 65 48 55 67 64 66 66 52 58 model does not depend on mention boundaries as input (which might also account for some loss in performance). 5.5. Ablation Study To verify some assumptions that we made during the design of the SE detection approach, we perform an ablation study using the page listings dataset D-Ptest . Firstly, we investigate how much chunking of items in input sequences influences the performance of the model. The results in Table 4 show that it has a slightly positive effect on precision (3% for enumerations, 4% for tables) and a roughly doubling effect on recall. The results confirm our assumption that the model is able to improve its predictions by considering the dependencies between the listing items. Further, we investigate whether the additional prediction of entity types has an influence on the performance (as opposed to a binary prediction of SEs). The results show that there is a positive effect on precision and a slightly negative effect on recall. As the F1 measure increases slightly and as the predicted types provide additional information for downstream tasks, we stick with type prediction instead of binary SE prediction. Additionally, we see from Table 4 that our negative sampling mechanism slightly increases the precision and recall of our final model. Consequently, the model seems to be able to learn whether there is some consistency between the listing items in the input sequence. Finally, the fine-tuning on pages has a very strong effect on recall as it comes with an increase of 25% and the precision of the model is also increased by 5%. This result confirms that additional fine-tuning on noisy labels still yields a huge benefit. 5.6. Subject Entity Extraction over Wikipedia Applying the model 𝑂𝑢𝑟𝑠𝑓 𝑖𝑛𝑎𝑙 to the complete dataset of Wikipedia listings 𝐷 took 13 hours on a single NVIDIA RTX A6000 GPU with 48GB of RAM. We extracted a total of 40 million entity mentions from 2.7M enumerations and 1M tables on 1.7 million pages. Of the 40 million entity mentions, 19.5 million can be traced back to 3.8 million known entities (i.e., the predicted mention boundary overlapped with an existing link in Wikipedia, and hence, CaLiGraph), which means that each known entity has on average 5.1 mentions. If we use that same factor of 5.1 to Table 5 Number of extracted mentions of subject entities for the whole Wikipedia dataset of listings 𝐷 aggregated by entity type. Entity Type #Mentions Entity Type #Mentions Entity Type #Mentions PERSON 13,622,704 GPE 1,519,747 NORP 230,707 OTHER 9,398,003 PRODUCT 1,000,117 LANGUAGE 86,354 WORK_OF_ART 7,148,235 SPECIES 964,922 LAW 11,490 ORG 2,916,528 FAC 893,226 LOC 1,531,452 EVENT 370,440 Total 39,693,925 estimate the number of entities for the remaining 20.5 million entity mentions, they describe 4 million additional unknown entities that could be added to the knowledge graph. In Table 5 we display the number of extracted entity mentions aggregated by entity type. Unsurprisingly, the most frequently extracted entities are of the types Person, Work of Art, Organisation, and Location. Apart from that, the mention type distribution roughly resembles the distribution of entities in DBpedia [27]. 6. Conclusion In this work, we have presented a Transformer-based SE detection approach that overcomes several limitations of prior work to make it applicable to more general settings and, at the same time, improve extraction performance. An evaluation of listings of Wikipedia pages shows that the performance for such a more general setting is considerably worse than for the scenario of Wikipedia list pages. While the inferior results can partly be attributed to conceptual limitations of SE detection in arbitrary listings (c.f. Section 3.1), further improvement is necessary so that the results can be consumed by downstream applications without extensive post-filtering. We are developing a post-filtering mechanism that takes the differences within an extracted group of SEs into account. For example, we can discard a group of extracted SEs if their predicted entity types show a high degree of diversion. In the extraction framework of CaLiGraph, we will integrate a subsequent entity disambigua- tion step, which matches the identified SE mentions with existing entities or creates new entities in the knowledge graph. The main challenge will be to match SEs with existing entities and at the same time match SEs with one another (as the same entity may be occurring in multiple listings). Complementary to the disambiguation step, we plan to further enhance CaLiGraph by using the defining axioms extracted from the listing context. The disambiguation step can be supported by the information extracted from the axioms, and similarly, the disambiguated entities can help to refine the axiom extraction. References [1] A. Sakor, K. Singh, A. Patel, M.-E. Vidal, Falcon 2.0: An entity and relation linking tool over Wikidata, in: 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 3141–3148. [2] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al., Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia, Semantic web 6 (2015) 167–195. [3] N. Heist, H. Paulheim, Entity extraction from Wikipedia list pages, in: European Semantic Web Conference, Springer, 2020, pp. 327–342. [4] M. Van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, J. Waitelonis, Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job, in: Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 4373–4379. [5] M. Färber, A. Rettinger, B. El Asmar, On emerging entity detection, in: European Knowl- edge Acquisition Workshop, Springer, 2016, pp. 223–238. [6] G. Liu, X. Li, J. Wang, M. Sun, P. Li, Extracting knowledge from web text with monte carlo tree search, in: The Web Conference 2020, 2020, pp. 2585–2591. [7] G. Stanovsky, J. Michael, L. Zettlemoyer, I. Dagan, Supervised open information extraction, in: 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 885–895. [8] F. M. Suchanek, G. Kasneci, G. Weikum, YAGO: a core of semantic knowledge, in: The World Wide Web Conference, 2007, pp. 697–706. [9] N. Heist, H. Paulheim, Uncovering the semantics of Wikipedia categories, in: International Semantic Web Conference, Springer, 2019, pp. 219–236. [10] B. Xu, C. Xie, Y. Zhang, Y. Xiao, H. Wang, W. Wang, Learning defining features for categories, in: 25th International Joint Conference on Artificial Intelligence, 2016, pp. 3924–3930. [11] S. Zhang, K. Balog, Web table extraction, retrieval, and augmentation: A survey, ACM Transactions on Intelligent Systems and Technology (TIST) 11 (2020) 1–35. [12] N. Heist, H. Paulheim, Information extraction from co-occurring similar entities, in: The Web Conference 2021, 2021, pp. 3999–4009. [13] X. Ling, S. Singh, D. S. Weld, Design challenges for entity linking, Transactions of the ACL 3 (2015) 315–328. [14] D. Milne, I. H. Witten, Learning to link with Wikipedia, in: 17th ACM conference on Information and knowledge management, 2008, pp. 509–518. [15] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30 (2007) 3–26. [16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, Journal of machine learning research 12 (2011) 2493–2537. [17] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, in: 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 260–270. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [20] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, C. Zhang, BOND: BERT-assisted open-domain named entity recognition with distant supervision, in: 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1054–1064. [21] S. Broscheit, Investigating entity knowledge in BERT with simple neural end-to-end entity linking, in: 23rd Conference on Computational Natural Language Learning (CoNLL), 2019, pp. 677–685. [22] I. Ermilov, A.-C. N. Ngomo, TAIPAN: automatic property mapping for tabular data, in: European Knowledge Acquisition Workshop, Springer, 2016, pp. 163–179. [23] L. Zhao, L. Li, X. Zheng, J. Zhang, A BERT based sentiment analysis and key entity detection approach for online financial texts, in: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), IEEE, 2021, pp. 1233–1238. [24] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [25] I. Segura-Bedmar, P. Martínez Fernández, M. Herrero Zazo, SemEval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013), ACL, 2013. [26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in: 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45. [27] N. Heist, S. Hertling, D. Ringler, H. Paulheim, Knowledge graphs on the web–an overview, Knowledge Graphs for eXplainable Artificial Intelligence: Foundations, Applications and Challenges (2020) 3–22.