<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>WikiMed-DE: Constructing a Silver-Standard Dataset for German Biomedical Entity Linking using Wikipedia and Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yi Wang</string-name>
          <email>yi.wang@ipvs.uni-stuttgart.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corina Dima</string-name>
          <email>corina.dima@ipvs.uni-stuttgart.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Staab</string-name>
          <email>s.r.staab@soton.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Southampton</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Stuttgart</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces WikiMed-DE, a silver-standard, automatically annotated biomedical entity linking dataset for the German language. WikiMed-DE encompasses a substantial collection of 53,981 articles from the German Wikipedia annotated with 1,951,081 mentions corresponding to 317,010 unique mention URLs. The hyperlinks of Wikipedia articles are used to connect concept mentions to Wikidata and transitively to three biomedical concept IDs: the Concept Unique Identifier from the Unified Medical Disease Ontology. A curated subset, WikiMed-DE-BEL, is released as a ready-to-use benchmark for biomedical entity linking in German. It features the same number of articles as WikiMed-DE, but only the highest-quality information is retained: 413,913 mentions corresponding to 35,012 unique concepts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Biomedical entity linking (BEL) is an important task for automatically processing text from the
medical domain. It enables the disambiguation of entities in text to unique identifiers in
ontologies like the Unified Medical Language System (UMLS) 1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the example MRI technology has
revolutionized medical imaging, the term MRI could stand for Magnetic Resonance Imaging
(C0024485), Multidrug Resistance Induction (C1513738), or Most Recent Inpatient
(C1546460). The goal of biomedical entity linking is to identify that the text span MRI in
the example should be mapped to the concept Magnetic Resonance Imaging, which has the
identifier C0024485 in the UMLS. In this paper we refer to text span MRI as a mention of the
concept Magnetic Resonance Imaging, and to the sentence in which this text span appears
as the context of the mention.
      </p>
      <p>
        Many datasets have been created in order to foster the development of reliable BEL systems,
e.g. the NCBI Disease dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which annotates mentions of diseases in PubMed2 abstracts;
MedMentions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a collection of PubMed abstracts annotated with concepts from the UMLS;
CEUR
Workshop
Proceedings
BC5CDR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a dataset that focuses on extracting and linking chemical compounds and diseases
from PubMed articles, where every entity was manually annotated by a team of Medical Subject
Headings (MeSH) 3 indexers; COMETA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a dataset consisting of 20,000 biomedical entity
mentions from Reddit, annotated by experts with links to SNOMED CT4 or RegEl [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which
maps the manual annotation of regulatory DNA elements within PubMed abstracts to various
ontologies - e.g. the tissue entities are mapped to Brenda Tissue Ontology (BTO) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], while
diseases are mapped to MONDO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>These datasets were manually annotated by domain experts and focus on entities of interest
from various domains - e.g. diseases, genes, tissues, chemical compounds, etc. And while many
of these datasets are not very large, they do provide invaluable information for training machine
learning models for the automatic disambiguation of biomedical entities.</p>
      <p>However, the vast majority of biomedical datasets provide annotations for English texts. For
other languages, like German, datasets annotated with biomedical concepts are extremely scarce
due to the resource-intensive nature of the manual creation process, which requires trained
professionals to perform the annotation process.</p>
      <p>
        This paper addresses this issue by introducing WikiMed-DE, a silver-standard dataset for
biomedical entity linking for the German language. The automatic annotation process makes
use of the links connecting the text of the German Wikipedia5 pages with the structured
information available in the Wikidata [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] knowledge base and in three knowledge sources from
the biomedical domain: the Unified Medical Language System (UMLS) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the Medical Subject
Headings (MeSH) hierarchy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the Disease Ontology (DO) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Our contributions are the following:
1. We build upon and extend the procedure introduced by Vashishth et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for creating
the English WikiMed dataset and construct a German dataset for biomedical entity linking
called WikiMed-DE, which we make publicly available6; WikiMed-DE is annotated with a
wide range of concepts from the UMLS; a subset of this dataset, WikiMed-DE-BEL, can be
readily used for training broad-coverage biomedical entity linking systems for German.
2. We provide an extensive description of the steps and resources required to create the
dataset, as well as a public code repository7; this makes it straightforward to apply the
procedure for creating similar datasets for other languages, or for updating the dataset
once new information is available in Wikipedia and Wikidata.
      </p>
      <p>The remainder of this article is organized as follows: Section 2 provides an overview of the
related work, Section 3 introduces the knowledge sources used to create WikiMed-DE, Section 4
describes the methodology used to construct WikiMed-DE, Section 5 presents the statistics
of the dataset and asseses its quality, Section 6 discusses the limitations of WikiMed-DE and
concludes the paper.</p>
      <p>3MeSH: https://www.ncbi.nlm.nih.gov/mesh/
4SNOMED-CT: https://www.snomed.org/
5German Wikipedia: https://de.wikipedia.org/wiki/Wikipedia:Hauptseite
6WikiMed-DE dataset: https://doi.org/10.5281/zenodo.8188966
7WikiMed-DE code repo:https://github.com/AI4MedCode/wikimed-de</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Datasets in the BEL domain typically focus on annotating biomedical entities from existing
ontologies. A point in case is the BC6BioID [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] dataset published for the BioCreative challenges,
which focuses on identifying genes or chemicals in English text and maps them to ontologies
like the Gene ontology8. It consists of 17,883 documents and 133,033 mentions referring to
7,652 unique concepts.
      </p>
      <p>
        Medical Concept Normalization (MCN) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a dataset that focuses primarily on annotating
entities of clinical utility, such as disorders, problems, tests, and treatments in discharge
summaries written in English. These entities are mapped to widely adopted medical terminologies,
such as the UMLS and the International Classification of Diseases (ICD) 9. MCN consists of
100 discharge summaries and provides normalization for a total of 10,919 concept mentions
corresponding to 3,792 unique concepts.
      </p>
      <p>
        The NCBI disease corpus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provides annotations of disease mentions along with their
corresponding concepts, which are represented using either MeSH or Online Mendelian Inheritance
in Man (OMIM) 10 identifiers. The corpus contains 6,892 disease mentions mapped to 790 unique
concepts for a collection of 793 PubMed abstracts written in English.
      </p>
      <p>
        MedMentions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a biomedical dataset containing 4,392 PubMed abstracts annotated with
203,282 mentions. Mentions are linked to UMLS concepts as well as to UMLS semantic types.
      </p>
      <p>
        Vashishth et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduce the WikiMed and PubMedDS datasets to facilitate research in
biomedical natural language processing. In constructing the WikiMed dataset, they select the
English Wikipedia as a comprehensive source of articles, while Wikidata11 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Freebase
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are used to establish mappings between the Wikipedia articles and UMLS concepts.
      </p>
      <p>
        The BRONCO [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] dataset is a valuable German-language resource for BEL and healthcare
research. It consists of 200 manually de-identified discharge summaries of cancer patients. The
discharge summaries were meticulously annotated with various medical terminologies,
including diagnoses, treatments, and medications, and further mapped to the German Modification
of the International Classification of Diseases (ICD-10-GM) 12. Because of its limited size, the
BRONCO dataset is mainly used for the evaluation of biomedical named entity recognition/entity
linking models (e.g. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]).
      </p>
      <p>
        In an efort to support the biomedical entity linking in languages other than English, Liu
et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] propose a cross-lingual biomedical entity linking evaluation benchmark, XL-BEL,
for evaluating BEL in 10 typologically diverse languages, including German. However, while
they make use of the WikiMed process introduced by Vashishth et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for creating this
resource, the benchmark contains only 1000 annotated sentences for each language. While this
amount of annotations is adequate for evaluation purposes, it does not sufice for training a
good quality biomedical entity linker.
      </p>
      <p>8GeneOntology: http://geneontology.org/
9ICD-10-CM: https://www.cdc.gov/nchs/icd/icd-10-cm.htm
10OMIM: https://www.omim.org/
11Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page
12ICD-10-GM: www.dimdi.de/dynamic/de/klassifikationen/icd/icd-10-gm/</p>
    </sec>
    <sec id="sec-4">
      <title>3. Structured Knowledge Repositories</title>
      <sec id="sec-4-1">
        <title>3.1. Wikidata</title>
        <p>
          Wikidata [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a collaborative knowledge base that serves as a data source for numerous
projects in the Wikimedia sphere [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], such as Wikipedia. The main objective of Wikidata is to
ensure consistent and high-quality data across the multiple language versions of Wikipedia.
At the moment of writing Wikidata contains more than 100M items13. Wikidata has garnered
significant attention from researchers across diverse fields of study [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], for example, in the
biomedical field, Mitraka et al. [ 21] used Wikidata as a knowledge base to collect biomedical
concepts such as NCBI Gene and map them to Wikipedia articles.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Unified Medical Language System (UMLS)</title>
        <p>
          The Unified Medical Language System (UMLS) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a repository of biomedical vocabularies
developed by the US National Library of Medicine. The main component of UMLS is the
Metathesaurus, which integrates more than 100 vocabularies from diferent subdomains, some in
several languages. These include the NCBI taxonomy [22], the MeSH hierarchy [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] in multiple
languages, the Gene Ontology, the International Classification of Diseases (ICD) 9 and 10 in
multiple languages, DrugBank, the Logical Observation Identifiers Names and Codes (LOINC) in
several languages, the Medical Dictionary for Regilatory Activities (MeDRA) and the
SNOMEDCT terminology, to name a few. When a concept is added to the Metathesaurus, it receives a
unique identifier entitled the Concept Unique Identifier (CUI). The CUI is used to connect all the
concepts from diferent source vocabularies that refer to the same meaning. For example, the
entry Carcinoma of breast from the SNOMEDCT_US 14 terminology and the entry Carcinomas,
Breast from the MeSH vocabulary are associated with the same UMLS CUI, C0678222. The
UMLS Metathesaurus is released twice a year. The current release, 2023AA15, contains ∼3.31
million concepts and 15.7 million unique concept names from 185 source vocabularies.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Medical Subject Headings (MeSH)</title>
        <p>
          Medical Subject Headings (MeSH) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a comprehensive controlled vocabulary used for
indexing, cataloging, and searching for biomedical and health-related information and documents.
MeSH consists of terms or descriptors representing various aspects of biomedical concepts, such
as diseases, anatomy, drugs, and medical procedures. Each MeSH term is assigned a unique
identifier called a MeSH ID or MeSH Heading (MH). Here is an example of an entry from MeSH
listing a term, its MeSH ID and its description:
• MeSH Term: Hypertension
• MeSH ID: D005260
• Description: A condition characterized by elevated blood pressure persistently exceeding
140 mm Hg systolic or 90 mm Hg diastolic.
13Wikidata statistics: https://www.wikidata.org/wiki/Wikidata:Statistics
14SNOMED-CT, US edition: https://www.nlm.nih.gov/healthit/snomedct/us_edition.html
15UMLS release: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/notes.html
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Disease Ontology (DO)</title>
        <p>
          The Disease Ontology (DO) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is a publicly accessible ontological representation of human
diseases, designed to establish unambiguous disease definitions based on etiological
classifications. Its primary objective is to ensure standardized utilization and incorporation of disease
information in biomedical data annotation. The latest version of the Disease Ontology was
released in June 2023 and contains 11,349 disease terms 16. Each disease in the Disease Ontology
is assigned a unique identifier called a Disease Ontology ID (DOID). For example, DOID:162
indicates the disease Breast Cancer. The Disease Ontology undergoes biannual updates to its
vocabulary mappings by extracting CUIs from the UMLS MRCONSO.RRF file. In the current
release 7075 of the DO terms are mapped to corresponding UMLS CUIs.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Constructing the WikiMed-DE Dataset</title>
      <p>16Disease Ontology release: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases</p>
      <p>UMLS
Wikipedia</p>
      <p>Entity</p>
      <p>UMLS
Concept
[C5200801]</p>
      <sec id="sec-5-1">
        <title>4.1. Obtaining German Wikipedia Articles</title>
        <p>The WikiMed-DE dataset is based on a recent database dump of the German Wikipedia from
20.06.202317. WikiExtractor18 was used to extract the article title, page id, URL and text from
the archive dewiki-20230620-pages-articles-multistream.xml.bz2 and save them in JSON format.
The text of each article contains HTML-encoded hyperlink tags to create clickable links, which
are retained for the annotation step. The WikiExtractor outputs roughly ∼10,000 files from
this archive, each containing on average 450 Wikipedia articles. These files are combined in a
post-processing step into a single JSON file containing 4,579,135 Wikipedia articles.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Mapping Wikipedia Articles to Wikidata</title>
        <p>We map the Wikipedia articles and their mentions to Wikidata QIDs using another file from the
German Wikipedia database dump, namely dewiki-20230620-page_props.sql.gz. This file relates
a page ID from the German Wikipedia to the corresponding Wikidata QID. For instance, the
17German Wikipedia dump, 20.06.2023: https://dumps.wikimedia.org/dewiki/20230620/
18WikiExtractor: https://github.com/attardi/wikiextractor
ifrst entry (1, 'wikibase_item', 'Q734916', NULL) in this file indicates that the Wikidata
QID associated with German Wikipedia page ID 1 is Q734916.</p>
        <p>The result of this mapping step is a CSV file containing the QID and the Wikipedia page
ID for each entry. Among the 4,579,135 Wikipedia articles from the previous step, a total
of 1,754,551 page IDs lack a corresponding QID. This is due to the fact that not all the page
IDs in the archive dewiki-20230620-pages-articles-multistream.xml.bz2 appear in the archive
dewiki-20230620-page_props.sql.gz. In the next steps we focus on the 2,824,584 articles that have
a corresponding QID.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Mapping Wikidata QIDs to Biomedical Concept IDs</title>
        <p>The oficial Wikidata SPARQL endpoint 19 was used to generate a mapping from QIDs to
biomedical concept IDs. Three properties from Wikidata were targeted:
• P2892:UMLS CUI20, which maps a Wikidata item to its UMLS CUI, if one is available
• P486:MeSH descriptor ID21, which maps a Wikidata item to the Medical Subject
Headings identifier, if it has one and
• P699:Disease Ontology ID22, which connects a Wikidata item to its ID in the Disease</p>
        <p>Ontology, if such a mapping exists.</p>
        <p>All three properties feature a single-value constraint23 in Wikidata, which states that this
property generally contains a single value per item. However, as we will show in Section 4.6,
these constraints are not enforced in Wikidata, making it possible for a Wikidata item to have,
for example, multiple CUIs associated to it. We obtained a mapping from QIDs to 763,859 UMLS
CUIs, 38,607 MeSH IDs and 10,609 DOIDs.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Filtering Wikipedia Articles</title>
        <p>WikiMed-DE is meant to serve as a training material for BEL models. It is therefore important
to filter the German Wikipedia articles and retain only those related to the biomedical domain.
The articles are filtered based on the mapping of QIDs to the three biomedical concepts of
interest described in Section 4.3. We retain only those articles where the QID is associated with
at least one of these three biomedical IDs, resulting in 54,514 articles. However, in some cases,
an article will have a title, a QID and a valid mapping, but no text - such articles are also filtered
out. At the end of the filtering step there are 53,981 German Wikipedia articles left.</p>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. Mapping Mentions to Wikidata</title>
        <p>In order to fulfill its goal as a training resource for BEL, the hyperlinked words or phrases in
the WikiMed-DE articles need to be identified and annotated with biomedical concepts. These
19Oficial Wikidata SPARQL endpoint: https://query.wikidata.org/
20P2892:UMLS CUI: https://www.wikidata.org/wiki/Property:P2892
21P486:MeSH descriptor ID: https://www.wikidata.org/wiki/Property:P486
22P699:Disease Ontology ID: https://www.wikidata.org/wiki/Property:P699
23Wikidata item for single-value constraint: https://www.wikidata.org/wiki/Q19474404
are extracted using regular expressions, by identifying HTML-encoded tags in the article text
and decoding them to generate URLs and clean text.</p>
        <p>In this step we store, for each article, it’s title, text, URL, and a list of mentions corresponding
to the hyperlinked words or phrases in the text. For each mention we record the text of the
mention and the corresponding URL. We then save a list containing 317,010 unique mention
URLs, which we need to map to Wikipedia page IDs. However, the URLs that are originally
associated with the mentions are diferent from the URLs needed to obtain the page ID for each
mention. For example, for the page with the title Getreide we need to map from the original
mention URL24 to the corresponding Wikipedia page information URL25.</p>
        <p>To obtain this mapping we do a request for each Wikipedia page information URL and retrieve
the corresponding Wikipedia page ID. 259,250 mention URLs are successfully matched to a
corresponding page ID in this way. The rest of the mention URLs (57,760 links) yield no page
information because some links do not exist in the German Wikipedia. For example, in Figure
2, the URL corresponding to the red mention Aktin-bindenden Proteinen lacks a corresponding
page ID because it is a link to a page that does not yet exist in the German Wikipedia. Such
cases are relatively frequent, as Wikipedia editors routinely add links to pages that will be
created only in a subsequent step.</p>
        <p>The next step is to map the page IDs corresponding to each mention to QIDs. We use the
previously generated CSV file (from Section 4.2), which contains a mapping from Wikipedia page
IDs to QIDs. Leveraging this resource 206,549 page IDs out of the 259,250 are uniquely mapped
to a QID. The mapping is, however, incomplete, with 52,701 page IDs lacking corresponding
QIDs. We examined a small sample of these URLs and discovered two issues: (i) some hyperlinks
point to sections within the same article, and therefore cannot be mapped to a separate QID
and (ii) some of the URLs extracted from the hyperlink tags are redirects. For example, the page
Hydrophil26 redirects to Hydrophilie27. The page ID information is not typically stored on the
redirect page, but only on the target URL.</p>
        <p>After the two previous steps 110,461 mention URLs are still not successfully mapped to their
corresponding QIDs: 57,760 mention URLs lack a page ID and 52,701 mention URLs have a page
ID but no QID. To address the redirection problem, we used the wikipedia28 Python package
to obtain a page ID for the URLs that were still missing a QID. This package allows one to look
for a Wikipedia page given the title of the page and the language code of the wiki ('de' in our
case). By specifying the flag redirect=True one can also find the page IDs for redirects. We
applied wikipedia’s page function to these 110,461 mention URLs and obtained an extra 46,541
correct mappings to QIDs.</p>
      </sec>
      <sec id="sec-5-6">
        <title>4.6. Integrating the Mappings to Biomedical Concept IDs</title>
        <p>The methodology described in Section 4.3 is used to map the QIDs to three biomedical concepts:
the UMLS CUI, the MeSH ID, and the DOID. When a particular Wikidata item contains
state24Mention URL for Getreide: https://de.wikipedia.org/wiki/Getreide
25Wikipedia page info URL for Getreide: https://de.wikipedia.org/w/index.php?title=Getreide&amp;action=info
26Hydrophil: https://de.wikipedia.org/wiki/Hydrophil
27Hydrophilie: https://de.wikipedia.org/wiki/Hydrophilie
28wikipedia Python package: https://pypi.org/project/wikipedia/
ments about biomedical IDs, it will, most frequently, contain a statement about the associated
UMLS CUI of that item. However, in some cases, the Wikidata items contain statements about
their MeSH IDs or their DOIDs, but does not include statements recording their UMLS CUIs.
Consequently, to increase the number of linked biomedical concepts, we extract UMLS CUIs, MeSH
IDs and DOIDs for each article. In the WikiMed-DE dataset, we use the tags wikidata_cui,
mesh and doid to label this information. Figure 2 shows a sample of WikiMed-DE: an article
together with its meta-information and its annotations.</p>
        <p>Filamine (FLN) sind Proteine
bei Eukaryoten und gehören zu
den Aktin-bindenden Proteinen (ABP).</p>
        <p>Sie sind an
der Quervernetzung von Aktinfilamenten,
einem Hauptbestandteil des Zytoskeletts,
sowie der Vernetzung von
Aktinfilamenten mit Proteinen in
der Zellmembran beteiligt.</p>
        <p>The MeSH hierarchy is integrated into the UMLS. Therefore we can use the file MRCONSO.RRF
from the UMLS release to map the MeSH IDs to UMLS CUIs, resulting in a mapping that is
not always unique. In WikiMed-DE, the CUIs mapped from MeSH IDs are saved under the
tag mesh_cui. A similar mapping is also performed for the DOIDs, using the file doid.json
from Disease Ontology’s current release. In WikiMed-DE the CUIs mapped from the DOIDs are
saved as doid_cui.</p>
        <p>Finally, we consolidate the CUI information saved under the tags wikidata_cui, mesh_cui
and doid_cui. If, under all these tags, there is only a single CUI for a mention or an article,
this unique CUI is saved under the tag cui in WikiMed-DE. This unique CUI is further mapped
to one or more TUIs using the file MRSTY.RRF from the UMLS release. The list of TUIs and the
corresponding semantic type labels are saved under tui and semantic_type, respectively.</p>
        <p>A part of the mentions will not have a single CUI, but several possible CUIs. This can have
multiple reasons: either the Wikidata item already maps to multiple CUIs, despite the fact
that the property P2892 has a single-value constraint; or the mapping from MeSH ID to CUIs
resulted in several CUIs; or the consolidation step lead to a list of CUIs rather than a single CUI.
In any case, we have no automatic method to choose a single correct CUI among the given ones,
so we will typically just record all this information.</p>
      </sec>
      <sec id="sec-5-7">
        <title>4.7. Combining All the Information to Create WikiMed-DE</title>
        <p>To obtain the final version of WikiMed-DE, we reprocess each of the articles and add the
information we extracted in the previous steps for each article. The text within each Wikipedia
page is decoded, removing any HTML tags and producing clean text and mention URLs. The
start and end indices for each mention are recorded, thus enabling precise identification of the
mention’s position. In some cases, the extracted start and end positions will not overlap with
natural token boundaries (see Figure 2 for an example). Each mention is associated with its
corresponding QID and all the extracted biomedical information.</p>
        <p>At the end of the extraction process, WikiMed-DE consists of a list of German Wikipedia
articles. For each article we save the article’s title, text, URL, QID, the biomedical indices (CUI,
TUIs, semantic type labels, Wikidata CUI, MeSH ID, MeSH-derived CUI, DOID and
DOIDderived CUI) and mention list, where the mentions correspond to the hyperlinked words or
phrases in the text. For each mention we record the start and end indices, URL, QID and the
biomedical indices (same as for the article).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. WikiMed-DE Dataset Statistics</title>
      <p>WikiMed-DE is a dataset consisting of 53,981 German Wikipedia articles, each containing
multiple mentions. The filtering step described in Section 4.4 ensures that all the articles in the
dataset describe biomedical concepts mentioned either in the UMLS, in the MeSH hierarchy or
in the Disease Ontology. The WikiMed-DE articles and the mentions therein are mapped to
QIDs and to biomedical concept IDs. We can therefore analyze the dataset at two levels: at the
article level and at the mention level.</p>
      <sec id="sec-6-1">
        <title>5.1. Statistics at the Mention Level</title>
        <p>WikiMed-DE contains a total of 1,951,081 mentions corresponding to 317,010 unique mention
URLs. As shown in Table 1, 95.79% of these mentions have an assigned QID, with the rest
having either missing links or missing QID information. CUI information was assigned for
29.59% of the mentions. Note that both directly through Wikidata and through the MeSH ID we
obtained a larger amount of assigned CUIs (38.30% and 35.46%, respectively). However, part
of these mentions have multiple CUIs assigned, and are therefore ambiguous from the point
of view of biomedical entity linking. The disambiguation step is non-trivial and cannot be
done without the help of experts, so we decided to just keep all the information in the dataset.
Disease Ontology information is available only for a small percentage of the mentions.
From 317,010 unique mention URLs, 79.82% have an assigned QID and 14.94% have an assigned
CUI. WikiMed-DE contains therefore links to 47,380 unique biomedical concepts. The number
of unique CUIs is much smaller than the number of QIDs. However, we believe that a large
number of the items that have a QID in this dataset will, at some point, be connected to the
UMLS and assigned a CUI. This is because many of the linked entities are still biomedical
entities that are just not yet marked as such in Wikidata, or are marked using other biomedical
IDs (e.g. P351: Entrez Gene ID). By including the QID information in the dataset, we give
the research community the possibility to customize the dataset annotations to include other
relevant biomedical information.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Statistics at the Article Level</title>
        <p>For the convenience of the researchers interested only in a biomedical entity linking benchmark
we provide a curated subset, called WikiMed-DE-BEL, which focuses exclusively on the mentions
that have an unique UMLS CUI associated to them. WikiMed-DE-BEL contains 413,913 mentions
corresponding to 35,012 unique mention URLs. All mentions are automatically annotated with a
single CUI. Mentions annotated with multiple CUIs are discarded. We also discard any mentions
where the mention start or end index does not coincide with the start or end of a token. The
dataset was divided into train, test and development splits using an 80-10-10 ratio — leading
to 43,184 train articles, 5,399 test articles and 5,398 dev articles. The dataset portions contain
330,233 (train), 41,120 (test) and 42,560 (dev) mentions, annotated with 222,247 (train), 9,123
(test) and 9,149 (dev) unique UMLS CUIs, respectively. 833 (9.12%) of the concepts in the test set
do not occur in training.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. WikiMed-DE-BEL quality</title>
        <p>To assess the data quality of WikiMed-DE-BEL, a sample of 50 mentions annotated with a single
CUI was randomly selected from the dataset. One of the authors checked the automatically
annotated CUI for each mention, comparing it to the information available in the UMLS, using
the UMLS Metathesaurus Browser29. The information was also compared to the context available
in the Wikipedia article. 100% of the mentions were found to link to the correct concept and
to match their context accurately. This shows that the strict filtering of problematic instances
(e.g. mentions with multiple CUIs, or with missing links or QIDs) lead to the creation of a
high-quality dataset.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Limitations and Conclusion</title>
      <p>Accuracy of Information: WikiMed-DE’s annotation process heavily relies on the accuracy
and completeness of information available in Wikipedia and Wikidata. However, these sources
are not immune to errors, inconsistencies, or vandalism. As a result, inaccuracies or outdated
information present in the source material can propagate into the annotations in WikiMed-DE.
Noise and Ambiguity: Automated annotation processes can introduce noise and ambiguity in
the dataset. The automated methods used to match Wikipedia articles with biomedical concepts
may still encounter challenges in fully disambiguating mentions - for example, we cannot
systematically choose a single CUI for an entity if multiple CUIs are annotated in Wikidata. We
try to limit the noise and ambiguity by enforcing stricter constraints - e.g., we only consider as
valid annotations the mentions with a unique CUI mapping. However, this makes the dataset
a silver standard dataset, since not all the proposed annotations were manually verified by
domain experts.</p>
      <p>Link coverage: Not all entities mentioned in a Wikipedia page are exhaustively marked with
hyperlinks, meaning that many possible mentions will not be annotated. Furthermore, because
we focus on the quality of annotations, we also end up discarding a portion of the marked
hyperlinks. This leads to a dataset that has a lower mention coverage than a typical biomedical
dataset. WikiMed-DE is therefore less suited for biomedical named entity recognition tasks.
However, we believe that it is a useful resource for training BEL systems, and it has great
potential to be further developed as new information is added to Wikidata.</p>
      <p>This paper presented a new resource for disambiguating biomedical entities in German,
WikiMed-DE, and a benchmark dataset for biomedical entity linking in German,
WikiMed-DEBEL, thus supporting biomedical entity linking research focusing on the German language.</p>
      <p>29UMLS Metathesaurus Browser: https://uts.nlm.nih.gov/uts/umls/home</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Yi Wang and Corina Dima were supported by the Ministry for Economics, Labour and Tourism
from Baden-Württemberg, Germany via grant agreement number BW1_1456 (AI4MedCode).
11153. arXiv:1908.11153.
[21] E. Mitraka, A. Waagmeester, S. Burgstaller-Muehlbacher, L. M. Schriml, A. I. Su, B. M.</p>
      <p>Good, Wikidata: A platform for data integration and dissemination for the life sciences
and beyond, in: J. Malone, R. Stevens, K. Forsberg, A. Splendiani (Eds.), Proceedings of
the 8th Semantic Web Applications and Tools for Life Sciences International Conference,
Cambridge UK, December 7-10, 2015, volume 1546 of CEUR Workshop Proceedings,
CEURWS.org, 2015, pp. 69–73. URL: https://ceur-ws.org/Vol-1546/paper_38.pdf.
[22] S. Federhen, The NCBI taxonomy database, Nucleic Acids Res. 40 (2012) 136–143. URL:
https://doi.org/10.1093/nar/gkr1178. doi:10.1093/nar/gkr1178.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The unified medical language system (UMLS): integrating biomedical terminology</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>267</fpage>
          -
          <lpage>270</lpage>
          . URL: https://doi.org/10.1093/nar/gkh061. doi:
          <volume>10</volume>
          .1093/nar/gkh061.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>R. I. Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>NCBI disease corpus: A resource for disease name recognition and concept normalization</article-title>
          ,
          <source>J. Biomed. Informatics</source>
          <volume>47</volume>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https://doi.org/10. 1016/j.jbi.
          <year>2013</year>
          .
          <volume>12</volume>
          .006. doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2013</year>
          .
          <volume>12</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Medmentions: A large biomedical corpus annotated with UMLS concepts</article-title>
          ,
          <source>in: 1st Conference on Automated Knowledge Base Construction, AKBC</source>
          <year>2019</year>
          , Amherst, MA, USA, May
          <volume>20</volume>
          -22,
          <year>2019</year>
          ,
          <year>2019</year>
          . URL: https://doi.org/10.24432/C5G59C. doi:
          <volume>10</volume>
          .24432/C5G59C.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sciaky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mattingly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Wiegers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Biocreative V CDR task corpus: a resource for chemical disease relation extraction</article-title>
          ,
          <source>Database J. Biol. Databases Curation</source>
          <year>2016</year>
          (
          <year>2016</year>
          ). URL: https://doi.org/10.1093/ database/baw068. doi:
          <volume>10</volume>
          .1093/database/baw068.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Basaldella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shareghi</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Collier, COMETA: A corpus for medical entity linking in the social media</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>3122</fpage>
          -
          <lpage>3137</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>253</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.253.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lenihan-Geels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Proft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuelke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seelow</surname>
          </string-name>
          , U. Leser,
          <article-title>Regel corpus: identifying DNA regulatory elements in the scientific literature</article-title>
          ,
          <source>Database J. Biol. Databases Curation</source>
          <year>2022</year>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1093/database/baac043. doi:
          <volume>10</volume>
          .1093/database/baac043.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gremse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Schomburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grote</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ebeling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schomburg</surname>
          </string-name>
          ,
          <article-title>The BRENDA tissue ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>39</volume>
          (
          <year>2011</year>
          )
          <fpage>507</fpage>
          -
          <lpage>513</lpage>
          . URL: https://doi.org/10.1093/nar/ gkq968. doi:
          <volume>10</volume>
          .1093/nar/gkq968.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Vasilevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Essaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Haendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mungall</surname>
          </string-name>
          ,
          <article-title>Mondo disease ontology: Harmonizing disease concepts across the world (short paper)</article-title>
          , in: J.
          <string-name>
            <surname>Hastings</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Loebe</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 11th International Conference on Biomedical Ontologies</source>
          (
          <article-title>ICBO) joint with the 10th Workshop on Ontologies and Data in Life Sciences (ODLS) and part of the Bolzano Summer of Knowledge (BoSK</article-title>
          <year>2020</year>
          ), Virtual conference hosted in Bolzano, Italy,
          <year>September 17</year>
          ,
          <year>2020</year>
          , volume
          <volume>2807</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2807</volume>
          /abstractY.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . URL: https://doi.org/10.1145/2629489. doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C. E.</surname>
          </string-name>
          <article-title>Lipscomb1, Medical Subject Headings (MeSH), in: Bulletin of the Medical Library Association</article-title>
          , volume
          <volume>88</volume>
          (
          <article-title>3</article-title>
          ) of CEUR Workshop Proceedings,
          <year>2000</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>266</lpage>
          . URL: https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35238/, pMID:
          <fpage>10928714</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Schriml</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nadendla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Felix</surname>
          </string-name>
          , G. Feng,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Kibbe</surname>
          </string-name>
          ,
          <article-title>Disease ontology: a backbone for disease semantic integration</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>40</volume>
          (
          <year>2012</year>
          )
          <fpage>940</fpage>
          -
          <lpage>946</lpage>
          . URL: https://doi.org/10.1093/nar/gkr972. doi:
          <volume>10</volume>
          .1093/nar/gkr972.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vashishth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Newman-Grifis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dutt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Rosé</surname>
          </string-name>
          ,
          <article-title>Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets</article-title>
          ,
          <source>J. Biomed. Informatics</source>
          <volume>121</volume>
          (
          <year>2021</year>
          )
          <article-title>103880</article-title>
          . URL: https://doi.org/10.1016/j.jbi.
          <year>2021</year>
          .
          <volume>103880</volume>
          . doi:
          <volume>10</volume>
          .1016/ j.jbi.
          <year>2021</year>
          .
          <volume>103880</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Arighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hirschman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lemberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Comeau</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Wu1, Bio-ID Track Overview</article-title>
          ,
          <source>BioCreative Workshop 482</source>
          (
          <year>2017</year>
          )
          <article-title>376</article-title>
          . URL: https://biocreative.bioinformatics. udel.edu/media/store/files/2018/BC6_track1_
          <fpage>1</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <string-name>
            <surname>MCN:</surname>
          </string-name>
          <article-title>A comprehensive corpus for medical concept normalization</article-title>
          ,
          <source>J. Biomed. Informatics</source>
          <volume>92</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10.1016/j.jbi.
          <year>2019</year>
          .
          <volume>103132</volume>
          . doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2019</year>
          .
          <volume>103132</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>K. D. Bollacker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Sturge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          , in: J. T.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD</source>
          <year>2008</year>
          , Vancouver, BC, Canada, June 10-12,
          <year>2008</year>
          , ACM,
          <year>2008</year>
          , pp.
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          . URL: https://doi.org/ 10.1145/1376616.1376746. doi:
          <volume>10</volume>
          .1145/1376616.1376746.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kittner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lamping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Rieke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Götze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bajwa</surname>
          </string-name>
          , I. Jelas, G. Rüter,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hautow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sänger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zettwitz</surname>
          </string-name>
          , T. de Bortoli, L. Ostermann,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ševa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kohlbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Malek</surname>
          </string-name>
          , U. Keilholz, U. Leser,
          <article-title>Annotation and initial evaluation of a large annotated german oncological corpus</article-title>
          ,
          <source>JAMIA Open 4</source>
          (
          <year>2021</year>
          )
          <article-title>ooab025</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonntag</surname>
          </string-name>
          ,
          <article-title>Cross-domain German medical named entity recognition using a pre-trained language model and unified medical semantic types</article-title>
          ,
          <source>in: Proceedings of the 5th Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>271</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .clinicalnlp-
          <volume>1</volume>
          .
          <fpage>31</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vulic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>Learning domain-specialised representations for cross-lingual biomedical entity linking</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP</source>
          <year>2021</year>
          , (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Virtual Event, August 1-6</source>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>574</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .acl-short.
          <volume>72</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .acl- short.72.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Müller-Birn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luczak-Rösch</surname>
          </string-name>
          ,
          <article-title>Peer-production system or collaborative ontology engineering efort: what is wikidata?</article-title>
          , in: D.
          <string-name>
            <surname>Riehle</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the 11th International Symposium on Open Collaboration</source>
          , San Francisco, CA, USA,
          <year>August</year>
          19-
          <issue>21</issue>
          ,
          <year>2015</year>
          , ACM,
          <year>2015</year>
          , pp.
          <volume>20</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          :
          <fpage>10</fpage>
          . URL: https://doi.org/10.1145/2788993.2789836. doi:
          <volume>10</volume>
          .1145/2788993.2789836.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Farda-Sarbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Müller-Birn</surname>
          </string-name>
          ,
          <article-title>Wikidata from a research perspective - A systematic mapping study of wikidata</article-title>
          , CoRR abs/
          <year>1908</year>
          .11153 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1908</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>