<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">WikiMed-DE: Constructing a Silver-Standard Dataset for German Biomedical Entity Linking using Wikipedia and Wikidata</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yi</forename><surname>Wang</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Stuttgart</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Corina</forename><surname>Dima</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Stuttgart</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Steffen</forename><surname>Staab</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Stuttgart</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Southampton</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">WikiMed-DE: Constructing a Silver-Standard Dataset for German Biomedical Entity Linking using Wikipedia and Wikidata</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">22839CACE575B50CB7264DA41FCA8D34</idno>
					<idno type="DOI">10.5281/zenodo.8188966</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper introduces WikiMed-DE, a silver-standard, automatically annotated biomedical entity linking dataset for the German language. WikiMed-DE encompasses a substantial collection of 53,981 articles from the German Wikipedia annotated with 1,951,081 mentions corresponding to 317,010 unique mention URLs. The hyperlinks of Wikipedia articles are used to connect concept mentions to Wikidata and transitively to three biomedical concept IDs: the Concept Unique Identifier from the Unified Medical Language System, the MeSH ID from Medical Subject Headings hierarchy, and the DOID from the Disease Ontology. A curated subset, WikiMed-DE-BEL, is released as a ready-to-use benchmark for biomedical entity linking in German. It features the same number of articles as WikiMed-DE, but only the highest-quality information is retained: 413,913 mentions corresponding to 35,012 unique concepts.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Biomedical entity linking (BEL) is an important task for automatically processing text from the medical domain. It enables the disambiguation of entities in text to unique identifiers in ontologies like the Unified Medical Language System (UMLS) 1  <ref type="bibr" target="#b1">[1]</ref>. In the example MRI technology has revolutionized medical imaging, the term MRI could stand for Magnetic Resonance Imaging (C0024485), Multidrug Resistance Induction (C1513738), or Most Recent Inpatient (C1546460). The goal of biomedical entity linking is to identify that the text span MRI in the example should be mapped to the concept Magnetic Resonance Imaging, which has the identifier C0024485 in the UMLS. In this paper we refer to text span MRI as a mention of the concept Magnetic Resonance Imaging, and to the sentence in which this text span appears as the context of the mention.</p><p>Many datasets have been created in order to foster the development of reliable BEL systems, e.g. the NCBI Disease dataset <ref type="bibr" target="#b2">[2]</ref>, which annotates mentions of diseases in PubMed 2 abstracts; MedMentions <ref type="bibr" target="#b3">[3]</ref>, a collection of PubMed abstracts annotated with concepts from the UMLS; Wikidata'23: Wikidata workshop at ISWC 2023 Envelope yi.wang@ipvs.uni-stuttgart.de (Y. Wang); corina.dima@ipvs.uni-stuttgart.de (C. Dima); steffen.staab@ipvs.uni-stuttgart.de; s.r.staab@soton.ac.uk (S. Staab) BC5CDR <ref type="bibr" target="#b4">[4]</ref>, a dataset that focuses on extracting and linking chemical compounds and diseases from PubMed articles, where every entity was manually annotated by a team of Medical Subject Headings (MeSH)<ref type="foot" target="#foot_0">3</ref> indexers; COMETA <ref type="bibr">[5]</ref>, a dataset consisting of 20,000 biomedical entity mentions from Reddit, annotated by experts with links to SNOMED CT <ref type="foot" target="#foot_1">4</ref> or RegEl <ref type="bibr" target="#b6">[6]</ref>, which maps the manual annotation of regulatory DNA elements within PubMed abstracts to various ontologies -e.g. the tissue entities are mapped to Brenda Tissue Ontology (BTO) <ref type="bibr" target="#b7">[7]</ref>, while diseases are mapped to MONDO <ref type="bibr" target="#b8">[8]</ref>.</p><p>These datasets were manually annotated by domain experts and focus on entities of interest from various domains -e.g. diseases, genes, tissues, chemical compounds, etc. And while many of these datasets are not very large, they do provide invaluable information for training machine learning models for the automatic disambiguation of biomedical entities.</p><p>However, the vast majority of biomedical datasets provide annotations for English texts. For other languages, like German, datasets annotated with biomedical concepts are extremely scarce due to the resource-intensive nature of the manual creation process, which requires trained professionals to perform the annotation process.</p><p>This paper addresses this issue by introducing WikiMed-DE, a silver-standard dataset for biomedical entity linking for the German language. The automatic annotation process makes use of the links connecting the text of the German Wikipedia<ref type="foot" target="#foot_2">5</ref> pages with the structured information available in the Wikidata <ref type="bibr" target="#b9">[9]</ref> knowledge base and in three knowledge sources from the biomedical domain: the Unified Medical Language System (UMLS) <ref type="bibr" target="#b1">[1]</ref>, the Medical Subject Headings (MeSH) hierarchy <ref type="bibr" target="#b10">[10]</ref> and the Disease Ontology (DO) <ref type="bibr" target="#b11">[11]</ref>.</p><p>Our contributions are the following:</p><p>1. We build upon and extend the procedure introduced by Vashishth et al. <ref type="bibr" target="#b12">[12]</ref> for creating the English WikiMed dataset and construct a German dataset for biomedical entity linking called WikiMed-DE, which we make publicly available <ref type="foot" target="#foot_3">6</ref> ; WikiMed-DE is annotated with a wide range of concepts from the UMLS; a subset of this dataset, WikiMed-DE-BEL, can be readily used for training broad-coverage biomedical entity linking systems for German. 2. We provide an extensive description of the steps and resources required to create the dataset, as well as a public code repository <ref type="foot" target="#foot_4">7</ref> ; this makes it straightforward to apply the procedure for creating similar datasets for other languages, or for updating the dataset once new information is available in Wikipedia and Wikidata.</p><p>The remainder of this article is organized as follows: Section 2 provides an overview of the related work, Section 3 introduces the knowledge sources used to create WikiMed-DE, Section 4 describes the methodology used to construct WikiMed-DE, Section 5 presents the statistics of the dataset and asseses its quality, Section 6 discusses the limitations of WikiMed-DE and concludes the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Datasets in the BEL domain typically focus on annotating biomedical entities from existing ontologies. A point in case is the BC6BioID <ref type="bibr" target="#b13">[13]</ref> dataset published for the BioCreative challenges, which focuses on identifying genes or chemicals in English text and maps them to ontologies like the Gene ontology <ref type="foot" target="#foot_5">8</ref> . It consists of 17,883 documents and 133,033 mentions referring to 7,652 unique concepts.</p><p>Medical Concept Normalization (MCN) <ref type="bibr" target="#b14">[14]</ref> is a dataset that focuses primarily on annotating entities of clinical utility, such as disorders, problems, tests, and treatments in discharge summaries written in English. These entities are mapped to widely adopted medical terminologies, such as the UMLS and the International Classification of Diseases (ICD) <ref type="foot" target="#foot_6">9</ref> . MCN consists of 100 discharge summaries and provides normalization for a total of 10,919 concept mentions corresponding to 3,792 unique concepts.</p><p>The NCBI disease corpus <ref type="bibr" target="#b2">[2]</ref> provides annotations of disease mentions along with their corresponding concepts, which are represented using either MeSH or Online Mendelian Inheritance in Man (OMIM) <ref type="foot" target="#foot_7">10</ref> identifiers. The corpus contains 6,892 disease mentions mapped to 790 unique concepts for a collection of 793 PubMed abstracts written in English.</p><p>MedMentions <ref type="bibr" target="#b3">[3]</ref> is a biomedical dataset containing 4,392 PubMed abstracts annotated with 203,282 mentions. Mentions are linked to UMLS concepts as well as to UMLS semantic types.</p><p>Vashishth et al. <ref type="bibr" target="#b12">[12]</ref> introduce the WikiMed and PubMedDS datasets to facilitate research in biomedical natural language processing. In constructing the WikiMed dataset, they select the English Wikipedia as a comprehensive source of articles, while Wikidata<ref type="foot" target="#foot_8">11</ref>  <ref type="bibr" target="#b9">[9]</ref> and Freebase <ref type="bibr" target="#b15">[15]</ref> are used to establish mappings between the Wikipedia articles and UMLS concepts.</p><p>The BRONCO <ref type="bibr" target="#b16">[16]</ref> dataset is a valuable German-language resource for BEL and healthcare research. It consists of 200 manually de-identified discharge summaries of cancer patients. The discharge summaries were meticulously annotated with various medical terminologies, including diagnoses, treatments, and medications, and further mapped to the German Modification of the International Classification of Diseases (ICD-10-GM) <ref type="foot" target="#foot_9">12</ref> . Because of its limited size, the BRONCO dataset is mainly used for the evaluation of biomedical named entity recognition/entity linking models (e.g. <ref type="bibr" target="#b17">[17]</ref>).</p><p>In an effort to support the biomedical entity linking in languages other than English, Liu et al. <ref type="bibr" target="#b18">[18]</ref> propose a cross-lingual biomedical entity linking evaluation benchmark, XL-BEL, for evaluating BEL in 10 typologically diverse languages, including German. However, while they make use of the WikiMed process introduced by Vashishth et al. <ref type="bibr" target="#b12">[12]</ref> for creating this resource, the benchmark contains only 1000 annotated sentences for each language. While this amount of annotations is adequate for evaluation purposes, it does not suffice for training a good quality biomedical entity linker.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Structured Knowledge Repositories</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Wikidata</head><p>Wikidata <ref type="bibr" target="#b9">[9]</ref> is a collaborative knowledge base that serves as a data source for numerous projects in the Wikimedia sphere <ref type="bibr" target="#b19">[19]</ref>, such as Wikipedia. The main objective of Wikidata is to ensure consistent and high-quality data across the multiple language versions of Wikipedia. At the moment of writing Wikidata contains more than 100M items 13 . Wikidata has garnered significant attention from researchers across diverse fields of study <ref type="bibr" target="#b20">[20]</ref>, for example, in the biomedical field, Mitraka et al. <ref type="bibr" target="#b21">[21]</ref> used Wikidata as a knowledge base to collect biomedical concepts such as NCBI Gene and map them to Wikipedia articles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Unified Medical Language System (UMLS)</head><p>The Unified Medical Language System (UMLS) <ref type="bibr" target="#b1">[1]</ref> is a repository of biomedical vocabularies developed by the US National Library of Medicine. The main component of UMLS is the Metathesaurus, which integrates more than 100 vocabularies from different subdomains, some in several languages. These include the NCBI taxonomy <ref type="bibr" target="#b22">[22]</ref>, the MeSH hierarchy <ref type="bibr" target="#b10">[10]</ref> in multiple languages, the Gene Ontology, the International Classification of Diseases (ICD) 9 and 10 in multiple languages, DrugBank, the Logical Observation Identifiers Names and Codes (LOINC) in several languages, the Medical Dictionary for Regilatory Activities (MeDRA) and the SNOMED-CT terminology, to name a few. When a concept is added to the Metathesaurus, it receives a unique identifier entitled the Concept Unique Identifier (CUI). The CUI is used to connect all the concepts from different source vocabularies that refer to the same meaning. For example, the entry Carcinoma of breast from the SNOMEDCT_US 14 terminology and the entry Carcinomas, Breast from the MeSH vocabulary are associated with the same UMLS CUI, C0678222. The UMLS Metathesaurus is released twice a year. The current release, 2023AA 15 , contains ∼3.31 million concepts and 15.7 million unique concept names from 185 source vocabularies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Medical Subject Headings (MeSH)</head><p>Medical Subject Headings (MeSH) <ref type="bibr" target="#b10">[10]</ref> is a comprehensive controlled vocabulary used for indexing, cataloging, and searching for biomedical and health-related information and documents. MeSH consists of terms or descriptors representing various aspects of biomedical concepts, such as diseases, anatomy, drugs, and medical procedures. Each MeSH term is assigned a unique identifier called a MeSH ID or MeSH Heading (MH). Here is an example of an entry from MeSH listing a term, its MeSH ID and its description:</p><formula xml:id="formula_0">• MeSH Term: Hypertension • MeSH ID: D005260</formula><p>• Description: A condition characterized by elevated blood pressure persistently exceeding 140 mm Hg systolic or 90 mm Hg diastolic.</p><p>13 Wikidata statistics: https://www.wikidata.org/wiki/Wikidata:Statistics 14 SNOMED-CT, US edition: https://www.nlm.nih.gov/healthit/snomedct/us_edition.html 15 UMLS release: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/notes.html</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Disease Ontology (DO)</head><p>The Disease Ontology (DO) <ref type="bibr" target="#b11">[11]</ref> is a publicly accessible ontological representation of human diseases, designed to establish unambiguous disease definitions based on etiological classifications. Its primary objective is to ensure standardized utilization and incorporation of disease information in biomedical data annotation. The latest version of the Disease Ontology was released in June 2023 and contains 11,349 disease terms <ref type="foot" target="#foot_10">16</ref> . Each disease in the Disease Ontology is assigned a unique identifier called a Disease Ontology ID (DOID). For example, DOID:162 indicates the disease Breast Cancer. The Disease Ontology undergoes biannual updates to its vocabulary mappings by extracting CUIs from the UMLS MRCONSO.RRF file. In the current release 7075 of the DO terms are mapped to corresponding UMLS CUIs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Constructing the WikiMed-DE Dataset</head><p>The WikiMed-DE dataset consists of a selection of articles from the German Wikipedia. The core of the construction process is the fact that each Wikipedia article has a unique Wikipedia page ID and that the majority of the Wikipedia page IDs have a corresponding Wikidata ID, the QID. The QID is a unique identifier of a data item in Wikidata, consisting of the letter Q followed by a sequence of digits. The QID connects Wikipedia pages written in various languages to the language-independent Wikidata items and transitively to all the structured knowledge already linked to each Wikidata item.</p><p>In the case of WikiMed-DE the QID is used to map each German Wikipedia article to three types of biomedical concept IDs: the Concept Unique Identifier (CUI) from the UMLS, the MeSH ID from the MeSH hierarchy, and the Disease Ontology ID (DOID) from the Disease Ontology.</p><p>The UMLS provides extensive coverage for biomedical concepts and encompasses several medical terminologies from different biomedical vocabularies. Annotating mentions with UMLS CUIs ensures that WikiMed-DE covers a diverse range of medical concepts. The MeSH hierarchy is used for the semantic indexing of PubMed. Because the MeSH hierarchy is integrated into the UMLS, each MeSH ID can be mapped to UMLS CUIs, thereby increasing the number of mentions in WikiMed-DE. The DOID is widely employed in biomedical research to annotate and analyze disease-related data. Similar to the MeSH ID, DOIDs can also be mapped to UMLS CUIs, thus enabling the integration of more disease-specific information into WikiMed-DE. By incorporating mentions linked to the UMLS, MeSH, and Disease Ontology WikiMed-DE gains access to an extensive array of interconnected biomedical concepts, resulting in a rich and diverse collection of biomedical-specific information.</p><p>Following WikiMed <ref type="bibr" target="#b12">[12]</ref>, the WikiMed-DE mentions annotated with UMLS CUIs are further mapped to UMLS semantic types. The UMLS semantic types serve as broad categories or classes that group medical concepts within UMLS. Each semantic type is identified by a unique identifier called the Type Unique Identifier (TUI), which is composed of the letter T followed by three digits. Semantic types in the UMLS encompass various categories such as sign or symptom (T184), cell component (T026), immunologic factor (T129), and others. There are a total of 127 semantic types connected via 54 relations in the Semantic Network of the UMLS. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Obtaining German Wikipedia Articles</head><p>The WikiMed-DE dataset is based on a recent database dump of the German Wikipedia from 20.06.2023 <ref type="foot" target="#foot_11">17</ref> . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Mapping Wikipedia Articles to Wikidata</head><p>We map the Wikipedia articles and their mentions to Wikidata QIDs using another file from the German Wikipedia database dump, namely dewiki-20230620-page_props.sql.gz. This file relates a page ID from the German Wikipedia to the corresponding Wikidata QID. For instance, the first entry (1, 'wikibase_item', 'Q734916', NULL) in this file indicates that the Wikidata QID associated with German Wikipedia page ID 1 is Q734916.</p><p>The result of this mapping step is a CSV file containing the QID and the Wikipedia page ID for each entry. Among the 4,579,135 Wikipedia articles from the previous step, a total of 1,754,551 page IDs lack a corresponding QID. This is due to the fact that not all the page IDs in the archive dewiki-20230620-pages-articles-multistream.xml.bz2 appear in the archive dewiki-20230620-page_props.sql.gz. In the next steps we focus on the 2,824,584 articles that have a corresponding QID.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Mapping Wikidata QIDs to Biomedical Concept IDs</head><p>The official Wikidata SPARQL endpoint <ref type="foot" target="#foot_13">19</ref> was used to generate a mapping from QIDs to biomedical concept IDs. Three properties from Wikidata were targeted:</p><p>• P2892:UMLS CUI <ref type="foot" target="#foot_14">20</ref> , which maps a Wikidata item to its UMLS CUI, if one is available • P486:MeSH descriptor ID <ref type="foot" target="#foot_15">21</ref> , which maps a Wikidata item to the Medical Subject Headings identifier, if it has one and • P699:Disease Ontology ID <ref type="foot" target="#foot_16">22</ref> , which connects a Wikidata item to its ID in the Disease Ontology, if such a mapping exists.</p><p>All three properties feature a single-value constraint <ref type="foot" target="#foot_17">23</ref> in Wikidata, which states that this property generally contains a single value per item. However, as we will show in Section 4.6, these constraints are not enforced in Wikidata, making it possible for a Wikidata item to have, for example, multiple CUIs associated to it. We obtained a mapping from QIDs to 763,859 UMLS CUIs, 38,607 MeSH IDs and 10,609 DOIDs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Filtering Wikipedia Articles</head><p>WikiMed-DE is meant to serve as a training material for BEL models. It is therefore important to filter the German Wikipedia articles and retain only those related to the biomedical domain. The articles are filtered based on the mapping of QIDs to the three biomedical concepts of interest described in Section 4.3. We retain only those articles where the QID is associated with at least one of these three biomedical IDs, resulting in 54,514 articles. However, in some cases, an article will have a title, a QID and a valid mapping, but no text -such articles are also filtered out. At the end of the filtering step there are 53,981 German Wikipedia articles left.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Mapping Mentions to Wikidata</head><p>In order to fulfill its goal as a training resource for BEL, the hyperlinked words or phrases in the WikiMed-DE articles need to be identified and annotated with biomedical concepts. These are extracted using regular expressions, by identifying HTML-encoded tags in the article text and decoding them to generate URLs and clean text.</p><p>In this step we store, for each article, it's title, text, URL, and a list of mentions corresponding to the hyperlinked words or phrases in the text. For each mention we record the text of the mention and the corresponding URL. We then save a list containing 317,010 unique mention URLs, which we need to map to Wikipedia page IDs. However, the URLs that are originally associated with the mentions are different from the URLs needed to obtain the page ID for each mention. For example, for the page with the title Getreide we need to map from the original mention URL <ref type="foot" target="#foot_18">24</ref> to the corresponding Wikipedia page information URL <ref type="foot" target="#foot_19">25</ref> .</p><p>To obtain this mapping we do a request for each Wikipedia page information URL and retrieve the corresponding Wikipedia page ID. 259,250 mention URLs are successfully matched to a corresponding page ID in this way. The rest of the mention URLs (57,760 links) yield no page information because some links do not exist in the German Wikipedia. For example, in Figure <ref type="figure" target="#fig_3">2</ref>, the URL corresponding to the red mention Aktin-bindenden Proteinen lacks a corresponding page ID because it is a link to a page that does not yet exist in the German Wikipedia. Such cases are relatively frequent, as Wikipedia editors routinely add links to pages that will be created only in a subsequent step.</p><p>The next step is to map the page IDs corresponding to each mention to QIDs. We use the previously generated CSV file (from Section 4.2), which contains a mapping from Wikipedia page IDs to QIDs. Leveraging this resource 206,549 page IDs out of the 259,250 are uniquely mapped to a QID. The mapping is, however, incomplete, with 52,701 page IDs lacking corresponding QIDs. We examined a small sample of these URLs and discovered two issues: (i) some hyperlinks point to sections within the same article, and therefore cannot be mapped to a separate QID and (ii) some of the URLs extracted from the hyperlink tags are redirects. For example, the page Hydrophil<ref type="foot" target="#foot_20">26</ref> redirects to Hydrophilie<ref type="foot" target="#foot_21">27</ref> . The page ID information is not typically stored on the redirect page, but only on the target URL.</p><p>After the two previous steps 110,461 mention URLs are still not successfully mapped to their corresponding QIDs: 57,760 mention URLs lack a page ID and 52,701 mention URLs have a page ID but no QID. To address the redirection problem, we used the wikipedia<ref type="foot" target="#foot_22">28</ref> Python package to obtain a page ID for the URLs that were still missing a QID. This package allows one to look for a Wikipedia page given the title of the page and the language code of the wiki ('de' in our case). By specifying the flag redirect=True one can also find the page IDs for redirects. We applied wikipedia's page function to these 110,461 mention URLs and obtained an extra 46,541 correct mappings to QIDs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Integrating the Mappings to Biomedical Concept IDs</head><p>The methodology described in Section 4.3 is used to map the QIDs to three biomedical concepts: the UMLS CUI, the MeSH ID, and the DOID. When a particular Wikidata item contains state-ments about biomedical IDs, it will, most frequently, contain a statement about the associated UMLS CUI of that item. However, in some cases, the Wikidata items contain statements about their MeSH IDs or their DOIDs, but does not include statements recording their UMLS CUIs. Consequently, to increase the number of linked biomedical concepts, we extract UMLS CUIs, MeSH IDs and DOIDs for each article. In the WikiMed-DE dataset, we use the tags wikidata_cui, mesh and doid to label this information. Figure <ref type="figure" target="#fig_3">2</ref> shows a sample of WikiMed-DE: an article together with its meta-information and its annotations.  The MeSH hierarchy is integrated into the UMLS. Therefore we can use the file MRCONSO.RRF from the UMLS release to map the MeSH IDs to UMLS CUIs, resulting in a mapping that is not always unique. In WikiMed-DE, the CUIs mapped from MeSH IDs are saved under the tag mesh_cui. A similar mapping is also performed for the DOIDs, using the file doid.json from Disease Ontology's current release. In WikiMed-DE the CUIs mapped from the DOIDs are saved as doid_cui.</p><p>Finally, we consolidate the CUI information saved under the tags wikidata_cui, mesh_cui and doid_cui. If, under all these tags, there is only a single CUI for a mention or an article, this unique CUI is saved under the tag cui in WikiMed-DE. This unique CUI is further mapped to one or more TUIs using the file MRSTY.RRF from the UMLS release. The list of TUIs and the corresponding semantic type labels are saved under tui and semantic_type, respectively.</p><p>A part of the mentions will not have a single CUI, but several possible CUIs. This can have multiple reasons: either the Wikidata item already maps to multiple CUIs, despite the fact that the property P2892 has a single-value constraint; or the mapping from MeSH ID to CUIs resulted in several CUIs; or the consolidation step lead to a list of CUIs rather than a single CUI.</p><p>In any case, we have no automatic method to choose a single correct CUI among the given ones, so we will typically just record all this information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7.">Combining All the Information to Create WikiMed-DE</head><p>To obtain the final version of WikiMed-DE, we reprocess each of the articles and add the information we extracted in the previous steps for each article. The text within each Wikipedia page is decoded, removing any HTML tags and producing clean text and mention URLs. The start and end indices for each mention are recorded, thus enabling precise identification of the mention's position. In some cases, the extracted start and end positions will not overlap with natural token boundaries (see Figure <ref type="figure" target="#fig_3">2</ref> for an example). Each mention is associated with its corresponding QID and all the extracted biomedical information. At the end of the extraction process, WikiMed-DE consists of a list of German Wikipedia articles. For each article we save the article's title, text, URL, QID, the biomedical indices (CUI, TUIs, semantic type labels, Wikidata CUI, MeSH ID, MeSH-derived CUI, DOID and DOIDderived CUI) and mention list, where the mentions correspond to the hyperlinked words or phrases in the text. For each mention we record the start and end indices, URL, QID and the biomedical indices (same as for the article).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">WikiMed-DE Dataset Statistics</head><p>WikiMed-DE is a dataset consisting of 53,981 German Wikipedia articles, each containing multiple mentions. The filtering step described in Section 4.4 ensures that all the articles in the dataset describe biomedical concepts mentioned either in the UMLS, in the MeSH hierarchy or in the Disease Ontology. The WikiMed-DE articles and the mentions therein are mapped to QIDs and to biomedical concept IDs. We can therefore analyze the dataset at two levels: at the article level and at the mention level. of unique CUIs is much smaller than the number of QIDs. However, we believe that a large number of the items that have a QID in this dataset will, at some point, be connected to the UMLS and assigned a CUI. This is because many of the linked entities are still biomedical entities that are just not yet marked as such in Wikidata, or are marked using other biomedical IDs (e.g. P351: Entrez Gene ID). By including the QID information in the dataset, we give the research community the possibility to customize the dataset annotations to include other relevant biomedical information.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Statistics at the Article Level</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">WikiMed-DE-BEL quality</head><p>To assess the data quality of WikiMed-DE-BEL, a sample of 50 mentions annotated with a single CUI was randomly selected from the dataset. One of the authors checked the automatically annotated CUI for each mention, comparing it to the information available in the UMLS, using the UMLS Metathesaurus Browser<ref type="foot" target="#foot_23">29</ref> . The information was also compared to the context available in the Wikipedia article. 100% of the mentions were found to link to the correct concept and to match their context accurately. This shows that the strict filtering of problematic instances (e.g. mentions with multiple CUIs, or with missing links or QIDs) lead to the creation of a high-quality dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Limitations and Conclusion</head><p>Accuracy of Information: WikiMed-DE's annotation process heavily relies on the accuracy and completeness of information available in Wikipedia and Wikidata. However, these sources are not immune to errors, inconsistencies, or vandalism. As a result, inaccuracies or outdated information present in the source material can propagate into the annotations in WikiMed-DE.</p><p>Noise and Ambiguity: Automated annotation processes can introduce noise and ambiguity in the dataset. The automated methods used to match Wikipedia articles with biomedical concepts may still encounter challenges in fully disambiguating mentions -for example, we cannot systematically choose a single CUI for an entity if multiple CUIs are annotated in Wikidata. We try to limit the noise and ambiguity by enforcing stricter constraints -e.g., we only consider as valid annotations the mentions with a unique CUI mapping. However, this makes the dataset a silver standard dataset, since not all the proposed annotations were manually verified by domain experts. Link coverage: Not all entities mentioned in a Wikipedia page are exhaustively marked with hyperlinks, meaning that many possible mentions will not be annotated. Furthermore, because we focus on the quality of annotations, we also end up discarding a portion of the marked hyperlinks. This leads to a dataset that has a lower mention coverage than a typical biomedical dataset. WikiMed-DE is therefore less suited for biomedical named entity recognition tasks. However, we believe that it is a useful resource for training BEL systems, and it has great potential to be further developed as new information is added to Wikidata. This paper presented a new resource for disambiguating biomedical entities in German, WikiMed-DE, and a benchmark dataset for biomedical entity linking in German, WikiMed-DE-BEL, thus supporting biomedical entity linking research focusing on the German language.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 Figure 1 :</head><label>11</label><figDesc>Figure 1 illustrates the process of constructing the WikiMed-DE dataset: Wikidata is used to map German Wikipedia articles to biomedical concepts such as the UMLS CUI. Both the articles and the mentions (i.e. the links) in the articles are mapped to biomedical concepts.</figDesc><graphic coords="6,67.40,134.08,161.66,92.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>WikiExtractor 18 was used to extract the article title, page id, URL and text from the archive dewiki-20230620-pages-articles-multistream.xml.bz2 and save them in JSON format. The text of each article contains HTML-encoded hyperlink tags to create clickable links, which are retained for the annotation step. The WikiExtractor outputs roughly ∼10,000 files from this archive, each containing on average 450 Wikipedia articles. These files are combined in a post-processing step into a single JSON file containing 4,579,135 Wikipedia articles.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Filamine</head><label></label><figDesc>(FLN) sind Proteine bei Eukaryoten und gehören zu den Aktin-bindenden Proteinen (ABP). Sie sind an der Quervernetzung von Aktinfilamenten, einem Hauptbestandteil des Zytoskeletts, sowie der Vernetzung von Aktinfilamenten mit Proteinen in der Zellmembran beteiligt.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A sample of WikiMed-DE. In this snippet, there are five kinds of mentions. Mentions in blue have a valid URL, a QID, and a unique CUI. The mention in red links to a German Wikipedia article that does not yet exist, so it has no QID. The mention in green has a valid URL and QID, but no CUI. The mention in orange has multiple CUIs. For the mention in lila, the HTML tags do not coincide with a full token of the underlying text -Zytoskelett vs Zytoskeletts.</figDesc><graphic coords="9,235.99,205.54,265.64,203.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>The distribution of mentions and unique entities one biomedical concepts</figDesc><table><row><cell></cell><cell>qid</cell><cell>cui</cell><cell>wikidata_cui</cell><cell>mesh</cell><cell>mesh_cui</cell><cell>doid</cell><cell>doid_cui</cell></row><row><cell>Number of Mentions</cell><cell cols="2">1,867,554 577,547</cell><cell>746,022</cell><cell>691,383</cell><cell>691,084</cell><cell>44,557</cell><cell>41,595</cell></row><row><cell>Percentage</cell><cell>95.79%</cell><cell>29.59%</cell><cell>38.30%</cell><cell>35.46%</cell><cell>35.44%</cell><cell>2.28%</cell><cell>2.13%</cell></row><row><cell>Number of Unique Mention URLs</cell><cell>253,090</cell><cell>47,380</cell><cell>57,751</cell><cell>31,901</cell><cell>31,866</cell><cell>4,039</cell><cell>3,534</cell></row><row><cell>Percentage</cell><cell>79.82%</cell><cell>14.94%</cell><cell>18.21%</cell><cell>10.06%</cell><cell>10.05%</cell><cell>1.27%</cell><cell>1.11%</cell></row><row><cell>Total Mentions</cell><cell></cell><cell></cell><cell cols="2">1,951,081</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Total Unique Mention URLs</cell><cell></cell><cell></cell><cell></cell><cell>317,010</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>displays the percentage of WikiMed-DE articles associated with various identifiers. All articles have an associated QID and are mapped to at least one biomedical concept ID. 88.39% of articles have a unique CUI associated to them, and are annotated with the three biomedical concept IDs of interest in various degrees -98.34% are annotated with one or multiple CUIs based on Wikidata information, 29.15% are annotated with a MeSH ID and 4.40% are annotated with DOIDs. WikiMed-DE contains a total of 198,356 unique QIDs, 66,955 unique UMLS CUIs (including the MeSH-derived CUIs and DOID-derived CUIs), 15,915 MeSH IDs, 2,400 DOID annotations, and 125 TUIs.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>The distribution biomedical concepts connected to the WikiMed-DE articles For the convenience of the researchers interested only in a biomedical entity linking benchmark we provide a curated subset, called WikiMed-DE-BEL, which focuses exclusively on the mentions that have an unique UMLS CUI associated to them. WikiMed-DE-BEL contains 413,913 mentions corresponding to 35,012 unique mention URLs. All mentions are automatically annotated with a single CUI. Mentions annotated with multiple CUIs are discarded. We also discard any mentions where the mention start or end index does not coincide with the start or end of a token. The dataset was divided into train, test and development splits using an 80-10-10 ratio -leading to 43,184 train articles, 5,399 test articles and 5,398 dev articles. The dataset portions contain 330,233 (train), 41,120 (test) and 42,560 (dev) mentions, annotated with 222,247 (train), 9,123 (test) and 9,149 (dev) unique UMLS CUIs, respectively. 833 (9.12%) of the concepts in the test set do not occur in training.</figDesc><table><row><cell></cell><cell>qid</cell><cell>cui</cell><cell>wikidata_cui</cell><cell>mesh</cell><cell>mesh_cui</cell><cell>doid</cell><cell>doid_cui</cell></row><row><cell cols="3">Number of articles 53,981 47,729</cell><cell>53,085</cell><cell>15,727</cell><cell>15,696</cell><cell>2,377</cell><cell>1,914</cell></row><row><cell>Percentage</cell><cell cols="2">100% 88.39%</cell><cell>98.34%</cell><cell>29.15%</cell><cell>29.09%</cell><cell>4.40%</cell><cell>3.55%</cell></row><row><cell>Total Articles</cell><cell></cell><cell></cell><cell></cell><cell>53,981</cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">MeSH: https://www.ncbi.nlm.nih.gov/mesh/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">SNOMED-CT: https://www.snomed.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">German Wikipedia: https://de.wikipedia.org/wiki/Wikipedia:Hauptseite</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">WikiMed-DE dataset: https://doi.org/10.5281/zenodo.8188966</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">WikiMed-DE code repo:https://github.com/AI4MedCode/wikimed-de</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">GeneOntology: http://geneontology.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">ICD-10-CM: https://www.cdc.gov/nchs/icd/icd-10-cm.htm</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">OMIM: https://www.omim.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">ICD-10-GM: www.dimdi.de/dynamic/de/klassifikationen/icd/icd-10-gm/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_10">Disease Ontology release: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="17" xml:id="foot_11">German Wikipedia dump, 20.06.2023: https://dumps.wikimedia.org/dewiki/20230620/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="18" xml:id="foot_12">WikiExtractor: https://github.com/attardi/wikiextractor</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="19" xml:id="foot_13">Official Wikidata SPARQL endpoint: https://query.wikidata.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="20" xml:id="foot_14">P2892:UMLS CUI: https://www.wikidata.org/wiki/Property:P2892</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="21" xml:id="foot_15">P486:MeSH descriptor ID: https://www.wikidata.org/wiki/Property:P486</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="22" xml:id="foot_16">P699:Disease Ontology ID: https://www.wikidata.org/wiki/Property:P699</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="23" xml:id="foot_17">Wikidata item for single-value constraint: https://www.wikidata.org/wiki/Q19474404</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="24" xml:id="foot_18">Mention URL for Getreide: https://de.wikipedia.org/wiki/Getreide</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="25" xml:id="foot_19">Wikipedia page info URL for Getreide: https://de.wikipedia.org/w/index.php?title=Getreide&amp;action=info</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="26" xml:id="foot_20">Hydrophil: https://de.wikipedia.org/wiki/Hydrophil</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="27" xml:id="foot_21">Hydrophilie: https://de.wikipedia.org/wiki/Hydrophilie</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="28" xml:id="foot_22">wikipedia Python package: https://pypi.org/project/wikipedia/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="29" xml:id="foot_23">UMLS Metathesaurus Browser: https://uts.nlm.nih.gov/uts/umls/home</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Yi Wang and Corina Dima were supported by the Ministry for Economics, Labour and Tourism from Baden-Württemberg, Germany via grant agreement number BW1_1456 (AI4MedCode).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Statistics at the Mention Level WikiMed-DE contains a total of 1</title>
		<imprint>
			<biblScope unit="volume">951</biblScope>
		</imprint>
	</monogr>
	<note>081 mentions corresponding to 317,010 unique mention URLs. As shown in Table 1, 95.79% of these mentions have an assigned QID, with the rest having either missing links or missing QID information. CUI information was assigned for 29.59% of the mentions. Note that both directly through Wikidata and through the MeSH ID we obtained a larger amount of assigned CUIs (38.30% and 35.46%, respectively). However, part of these mentions have multiple CUIs assigned, and are therefore ambiguous from the point of view of biomedical entity linking. The disambiguation step is non-trivial and cannot be done without the help of experts, so we decided to just keep all the information in the dataset. Disease Ontology information is available only for a small percentage of the mentions. From 317,010 unique mention URLs, 79.82% have an assigned QID and 14.94% have an assigned CUI. WikiMed-DE contains therefore links to 47,380 unique biomedical concepts. The number References</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The unified medical language system (UMLS): integrating biomedical terminology</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bodenreider</surname></persName>
		</author>
		<idno type="DOI">10.1093/nar/gkh061</idno>
		<ptr target="https://doi.org/10.1093/nar/gkh061.doi:10.1093/nar/gkh061" />
	</analytic>
	<monogr>
		<title level="j">Nucleic Acids Res</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="267" to="270" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">NCBI disease corpus: A resource for disease name recognition and concept normalization</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">I</forename><surname>Dogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Leaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jbi.2013.12.006</idno>
		<ptr target="https://doi.org/10.1016/j.jbi.2013.12.006.doi:10.1016/j.jbi.2013.12.006" />
	</analytic>
	<monogr>
		<title level="j">J. Biomed. Informatics</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Medmentions: A large biomedical corpus annotated with UMLS concepts</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.24432/C5G59C</idno>
		<ptr target="https://doi.org/10.24432/C5G59C.doi:10.24432/C5G59C" />
	</analytic>
	<monogr>
		<title level="m">1st Conference on Automated Knowledge Base Construction, AKBC 2019</title>
				<meeting><address><addrLine>Amherst, MA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">May 20-22, 2019, 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Biocreative V CDR task corpus: a resource for chemical disease relation extraction</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sciaky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Leaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Mattingly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">C</forename><surname>Wiegers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<idno type="DOI">10.1093/database/baw068</idno>
		<ptr target="https://doi.org/10.1093/database/baw068.doi:10.1093/database/baw068" />
	</analytic>
	<monogr>
		<title level="j">Database J. Biol. Databases Curation</title>
		<imprint>
			<biblScope unit="page">2016</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">COMETA: A corpus for medical entity linking in the social media</title>
		<author>
			<persName><forename type="first">M</forename><surname>Basaldella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shareghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Collier</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.253</idno>
		<ptr target="https://doi.org/10.18653/v1/2020.emnlp-main.253.doi:10.18653/v1/2020.emnlp-main.253" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Webber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020</meeting>
		<imprint>
			<date type="published" when="2020">November 16-20, 2020. 2020</date>
			<biblScope unit="page" from="3122" to="3137" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Regel corpus: identifying DNA regulatory elements in the scientific literature</title>
		<author>
			<persName><forename type="first">S</forename><surname>Garda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lenihan-Geels</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Proft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hochmuth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schuelke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Seelow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
		<idno type="DOI">10.1093/database/baac043</idno>
		<ptr target="https://doi.org/10.1093/database/baac043.doi:10.1093/database/baac043" />
	</analytic>
	<monogr>
		<title level="j">Database J. Biol. Databases Curation</title>
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The BRENDA tissue ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources</title>
		<author>
			<persName><forename type="first">M</forename><surname>Gremse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Schomburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Grote</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Scheer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ebeling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schomburg</surname></persName>
		</author>
		<idno type="DOI">10.1093/nar/gkq968</idno>
		<ptr target="https://doi.org/10.1093/nar/gkq968.doi:10.1093/nar/gkq968" />
	</analytic>
	<monogr>
		<title level="j">Nucleic Acids Res</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="page" from="507" to="513" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Mondo disease ontology: Harmonizing disease concepts across the world (short paper)</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Vasilevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Essaid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Matentzoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">L</forename><surname>Harris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Haendel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Robinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Mungall</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-2807/abstractY.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th International Conference on Biomedical Ontologies (ICBO) joint with the 10th Workshop on Ontologies and Data in Life Sciences (ODLS) and part of the Bolzano Summer of Knowledge (BoSK 2020), Virtual conference hosted in</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Hastings</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Loebe</surname></persName>
		</editor>
		<meeting>the 11th International Conference on Biomedical Ontologies (ICBO) joint with the 10th Workshop on Ontologies and Data in Life Sciences (ODLS) and part of the Bolzano Summer of Knowledge (BoSK 2020), Virtual conference hosted in<address><addrLine>Bolzano, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020-09-17">September 17, 2020. 2020</date>
			<biblScope unit="volume">2807</biblScope>
			<biblScope unit="page" from="1" to="2" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Wikidata: a free collaborative knowledgebase</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vrandecic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krötzsch</surname></persName>
		</author>
		<idno type="DOI">10.1145/2629489</idno>
		<ptr target="https://doi.org/10.1145/2629489.doi:10.1145/2629489" />
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="78" to="85" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Medical Subject Headings (MeSH)</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">E</forename></persName>
		</author>
		<ptr target="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC35238/,pMID:10928714" />
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2000">Lipscomb1. 2000</date>
			<biblScope unit="volume">88</biblScope>
			<biblScope unit="page" from="265" to="266" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Disease ontology: a backbone for disease semantic integration</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Schriml</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Arze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nadendla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mazaitis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Felix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">A</forename><surname>Kibbe</surname></persName>
		</author>
		<idno type="DOI">10.1093/nar/gkr972</idno>
		<ptr target="https://doi.org/10.1093/nar/gkr972.doi:10.1093/nar/gkr972" />
	</analytic>
	<monogr>
		<title level="j">Nucleic Acids Res</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="page" from="940" to="946" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets</title>
		<author>
			<persName><forename type="first">S</forename><surname>Vashishth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Newman-Griffis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dutt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">P</forename><surname>Rosé</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jbi.2021.103880</idno>
		<ptr target="https://doi.org/10.1016/j.jbi.2021.103880.doi:10.1016/j.jbi.2021.103880" />
	</analytic>
	<monogr>
		<title level="j">J. Biomed. Informatics</title>
		<imprint>
			<biblScope unit="volume">121</biblScope>
			<biblScope unit="page">103880</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Bio-ID Track Overview</title>
		<author>
			<persName><forename type="first">C</forename><surname>Arighi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hirschman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lemberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bayer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Liecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Comeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wu1</surname></persName>
		</author>
		<ptr target="https://biocreative.bioinformatics.udel.edu/media/store/files/2018/BC6_track1_1.pdf" />
	</analytic>
	<monogr>
		<title level="j">BioCreative Workshop</title>
		<imprint>
			<biblScope unit="volume">482</biblScope>
			<biblScope unit="page">376</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">MCN: A comprehensive corpus for medical concept normalization</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rumshisky</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jbi.2019.103132</idno>
		<ptr target="https://doi.org/10.1016/j.jbi.2019.103132.doi:10.1016/j.jbi.2019.103132" />
	</analytic>
	<monogr>
		<title level="j">J. Biomed. Informatics</title>
		<imprint>
			<biblScope unit="volume">92</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Freebase: a collaboratively created graph database for structuring human knowledge</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">D</forename><surname>Bollacker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Paritosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sturge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Taylor</surname></persName>
		</author>
		<idno type="DOI">10.1145/1376616.1376746</idno>
		<idno>doi:10.1145/1376616.1376746</idno>
		<ptr target="https://doi.org/10.1145/1376616.1376746" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Wang</surname></persName>
		</editor>
		<meeting>the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008<address><addrLine>Vancouver, BC, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">June 10-12, 2008. 2008</date>
			<biblScope unit="page" from="1247" to="1250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Annotation and initial evaluation of a large annotated german oncological corpus</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kittner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lamping</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Rieke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Götze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bajwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Jelas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rüter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hautow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sänger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Habibi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zettwitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>De Bortoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ostermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ševa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Starlinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kohlbacher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">P</forename><surname>Malek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Keilholz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JAMIA Open</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">25</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Cross-domain German medical named entity recognition using a pre-trained language model and unified medical semantic types</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hartmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sonntag</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.clinicalnlp-1.31" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th Clinical Natural Language Processing Workshop, Association for Computational Linguistics</title>
				<meeting>the 5th Clinical Natural Language Processing Workshop, Association for Computational Linguistics<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="259" to="271" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Learning domain-specialised representations for cross-lingual biomedical entity linking</title>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Vulic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korhonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Collier</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.acl-short.72</idno>
		<ptr target="https://doi.org/10.18653/v1/2021.acl-short.72.doi:10.18653/v1/2021.acl-short.72" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Zong</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</editor>
		<meeting>the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2021">August 1-6, 2021. 2021</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="565" to="574" />
		</imprint>
	</monogr>
	<note>Short Papers), Virtual Event</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Peer-production system or collaborative ontology engineering effort: what is wikidata?</title>
		<author>
			<persName><forename type="first">C</forename><surname>Müller-Birn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Karran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Luczak-Rösch</surname></persName>
		</author>
		<idno type="DOI">10.1145/2788993.2789836</idno>
		<idno>doi:10.1145/2788993.2789836</idno>
		<ptr target="https://doi.org/10.1145/2788993.2789836" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th International Symposium on Open Collaboration</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Riehle</surname></persName>
		</editor>
		<meeting>the 11th International Symposium on Open Collaboration<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">August 19-21, 2015. 2015</date>
			<biblScope unit="page">10</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Farda-Sarbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Müller-Birn</surname></persName>
		</author>
		<idno>CoRR abs/1908.11153</idno>
		<ptr target="http://arxiv.org/abs/1908.11153.arXiv:1908.11153" />
		<title level="m">Wikidata from a research perspective -A systematic mapping study of wikidata</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Wikidata: A platform for data integration and dissemination for the life sciences and beyond</title>
		<author>
			<persName><forename type="first">E</forename><surname>Mitraka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Waagmeester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Burgstaller-Muehlbacher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Schriml</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">I</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">M</forename><surname>Good</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-1546/paper_38.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th Semantic Web Applications and Tools for Life Sciences International Conference</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Malone</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Stevens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Forsberg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Splendiani</surname></persName>
		</editor>
		<meeting>the 8th Semantic Web Applications and Tools for Life Sciences International Conference<address><addrLine>Cambridge UK</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">December 7-10, 2015. 2015</date>
			<biblScope unit="volume">1546</biblScope>
			<biblScope unit="page" from="69" to="73" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">The NCBI taxonomy database</title>
		<author>
			<persName><forename type="first">S</forename><surname>Federhen</surname></persName>
		</author>
		<idno type="DOI">10.1093/nar/gkr1178</idno>
		<ptr target="https://doi.org/10.1093/nar/gkr1178.doi:10.1093/nar/gkr1178" />
	</analytic>
	<monogr>
		<title level="j">Nucleic Acids Res</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="page" from="136" to="143" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
