<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anastasia Zhukova∗</string-name>
          <email>zhukova@uni-wuppertal.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Hamborg†</string-name>
          <email>felix.hamborg@uni-konstanz.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bela Gipp∗</string-name>
          <email>gipp@uni-wuppertal.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>1https://github.com/anastasia-zhukova/ANEA</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International</institution>
          ,
          <addr-line>(CC BY 4.0).</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Konstanz</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Wuppertal</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we ifnd that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Named entity recognition (NER), a common preprocessing step
in natural language processing (NLP) for various tasks, such as
information extraction, summarization, question and answering,
and text understanding, is often criticized for capably representing
only datasets with few general categories, e.g., person, location,
organization, and time (including their subcategories) [
        <xref ref-type="bibr" rid="ref30 ref5">5, 30</xref>
        ]. While
the original NER task contains only few categories, a rapidly
increasing number of NER applications show a high demand for the
datasets with domain-specific named entities [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        To (semi-)automatically create large general-purpose NER
corpora, recent research projects extensively use structured domain
sources, such as dictionaries, knowledge graphs, and Wikipedia or
other knowledge bases [
        <xref ref-type="bibr" rid="ref17 ref24 ref31">17, 24, 31</xref>
        ].
      </p>
      <p>In this paper, we propose ANEA, an unsupervised
Wiktionarybased approach that automatically derives domain entities from
German texts, i.e., low-resource language, by (1) extracting terms from
topically-related domain texts, (2) identifying the most
domainrepresentative, i.e., semantically distinct, terms of the analyzed
texts, and (3) automatically annotating the terms, i.e., ANEA
extracts labels from Wiktionary and assigns them to the identified
groups of terms. Not all of the domain categories may be named,
e.g., machinery or process.</p>
      <p>By automating the most labor-intense parts, the proposed
unsupervised approach minimizes the cost of expensive and
laborious annotations required for the creation of domain-specific NER
datasets. Typical manual tasks in annotations include (1) reading
the domain text multiple times, (2) deriving entities based on the
text content, and (3) manually selecting terms that match the
derived categories. ANEA substitutes the most time-consuming task
of deriving a coding book and automatically defines categories
and annotates the most representative terms (nouns) into these
categories. We evaluate the approach with user studies on multiple
domain datasets against multiple silver datasets and discuss a
default input configuration for ANEA to annotate other domain NER
datasets1.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        NER datasets usually contain standard types, e.g., person, location,
organization, and are manually annotated [
        <xref ref-type="bibr" rid="ref1 ref21">1, 21</xref>
        ] or automatically
extracted [
        <xref ref-type="bibr" rid="ref17 ref24 ref31">17, 24, 31</xref>
        ]. Domain-specific NER typically needs to
introduce domain-specific (sub-)categories of the established named
entity (NE) categories or entirely new categories. This is because
domain-specific texts contain NE categories that are (1) detailed
variants of the standard NE categories, e.g., “Person” is replaced
with the domain-specific sub-categories “Players” and “Coaches”
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], (2) standard NE categories extended with a small number of
new categories, e.g., “Trigger of a trafic jam” [
        <xref ref-type="bibr" rid="ref11 ref19 ref22">11, 19, 22</xref>
        ], and (3)
domain-derived NE categories, e.g., “Proteins” in biology or
“Reactions” in chemistry domains [
        <xref ref-type="bibr" rid="ref18 ref25 ref30 ref9">9, 18, 25, 30</xref>
        ]. Most domain-derived NE
categories originate from structured classifications or dictionaries
[
        <xref ref-type="bibr" rid="ref12 ref25 ref9">9, 12, 25</xref>
        ] or are derived by manually unifying multiple of them [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In sum, creating domain-specific datasets for NER requires expert
knowledge and is time-consuming.
      </p>
      <p>
        To minimize such eforts, some NER approaches use seed-NEs,
i.e., a small number of manually provided terms and their
NEcategories [
        <xref ref-type="bibr" rid="ref16 ref30 ref7">7, 16, 30</xref>
        ]. Such approaches use the seed-NEs as
examples to extract patterns of NE definitions and apply them to the
full text suggested for annotation. These NER approaches sufer
from the slow updates of the underlying domain knowledge bases
(KB) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and perform worse on lower-resource languages than
on English [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. An alternative to domain-KBs is community KBs,
such as Wikipedia and Wiktionary, which are constantly updated
by their communities. They prove to contain a suficient amount of
domain information [
        <xref ref-type="bibr" rid="ref14 ref29">14, 29</xref>
        ].
Unlike the existing supervised approaches for annotating
domainspecific named entities [
        <xref ref-type="bibr" rid="ref13 ref6">6, 13</xref>
        ], in this paper, we explore ANEA, an
unsupervised method to support researchers and users during the
creation of a coding book. Given a set of domain- or use case-specific
documents, ANEA automatically derives domain-specific categories
and exemplary terms within. This way, ANEA automates the most
time-intensive, previously manual tasks. As a consequence, users
only need to revise these terms, e.g., by renaming the categories or
re-annotating the not matching to the categories terms.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
      <p>
        We propose an unsupervised approach for annotation of
domainspecific (named) entities (ANEA) on a lower-resource language. The
goal of ANEA is to fully automatically derive entity categories (later
in the text: categories) by selecting groups of related terms and
extract and assign a meaningful label to these terms. To do so, ANEA,
ifrst, links terms extracted from domain-specific texts to pages in
Wiktionary [
        <xref ref-type="bibr" rid="ref14 ref24">14, 24</xref>
        ]. Second, ANEA automatically identifies groups
of related terms and automatically labels them by performing a
double optimization task of both maximizations of cross-similarity
of terms in a group and the average similarity of these terms to a
candidate label. That is, the approach consists of two main steps: (1)
text preprocessing, i.e., term mapping to Wiktionary pages (WPs)
and construction of a domain graph, and (2) identification of related
terms and label assignment.
3.1
      </p>
      <p>Preprocessing and domain graph
3.1.1 Preprocessing. The goal of preprocessing is to extract terms
from the set of texts and maximize the number of terms aligned to
the Wiktionary structure, i.e., map the domain-specific terminology
to the structured knowledge base. The mapping of the extracted
terms to the knowledge graph enables using their semantic
information, such as term definitions, areas, hypernyms, and hyponyms
(see Figure 1).</p>
      <p>
        The preprocessing steps include parsing and part-of-speech
(POS) tagging using spaCy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We define a term as any unique
noun phrase that does not contain any digits [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. After extraction,
terms are mapped to their respective German WP, if any, i.e., each
term gets assigned to a link of a WP.
      </p>
      <p>In German texts, we find that many domain-specific terms are
compound words, i.e., words that consist of more than one noun
component, for example, “Sechszylindermotor” = “sechs” +
“Zylinder” + “Motor” (six-cylinder motor). Typically, such complex domain
compound words are not described in Wiktionary since they are
too rare or specific. On the contrary, we observe that compound
words’ heads, i.e., the part of a composition term that bears the
core meaning of the phrase, e.g., “Motor” in the example above, are
highly likely to have a WP in Wiktionary.</p>
      <p>
        To map rare domain-specific terms to WPs, we extract the heads
of the extracted terms with a compound splitter, i.e., a model that
splits terms into two parts, compound- and head-parts, [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and
attempt to map the heads to Wiktionary. If a (multi-token) term
has a corresponding WP, we set a full term as a term’s head. If the
compound splitter outputs a head that is not a part of Wiktionary,
we continue recursively search for a head that can be mapped to a
WP. If no heads have corresponding WPs, then we do not assign a
head to a term.
      </p>
      <p>The preprocessing could be changed to include the terms with
digits, but for now, we focused on the noun phrases as terms. If a
term or its head of compound phrase do not have a WP, they are
excluded from annotation because the absence of a link to Wiktionary
leads to an inability to map terms to potential category labels. Later,
such discarded terms can be manually classified by human
annotators or automatically with state-of-the-art NER models trained on
the automatically created domain datasets.</p>
      <p>
        We use fastText to vectorize extracted terms and candidate
labels [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We chose fastText due to its ability to vectorize
out-ofvocabulary words, which often happens with domain-specific
terminology.
3.1.2 Domain graph. The domain graph is a locally stored
knowledge graph where the leaves are the extracted domain terms. Nodes
are all terms obtained from the WPs linked to the leaves and to
each other with hyponymy-hypernymy relations. Figure 2a depicts
the principle of a domain graph. The construction of the domain
graph includes three steps: (1) graph initialization, i.e., extraction
of the WP properties, e.g., definitions, with the scraping of the WPs
assigned to the domain terms and their heads; (2) determination of
pruning criteria of Wiktionary graph to scrape only domain related
pages; (3) expansion of the domain graph, i.e., scrapping of the
hypernym pages, to create a pool of candidate labels to later
annotated the identified groups of terms. Figure 2b shows the process of
domain graph construction.
3.1.3 Initialization. To initialize the graph, we use the extracted
terms to which we mapped WPs and scrap the mapped WPs to
extract WPs’ properties, e.g., hypernyms. As a preliminary step of
the graph initialization, we group the extracted terms by their head.
The head grouping aims at the extraction of the initial
hyponymhypernym relation for the domain graph. Then, we sort the list of
heads in decreasing order by (1) the number of unique terms with
each head, (2) the frequency of the overall in-text occurrence of
words with such head.
      </p>
      <p>To maximize the descriptiveness and generalization of the terms
that will become annotated into categories, we initialize the
domain graph with the terms-to-annotate that belong to the top 
largest head groups, i.e., containing the most lexically diverse and/or
frequent terms. Section 4 determines an optimal value of
terms-toannotate and the largest head groups the series of experiments. This
ifltering procedure reduces the size of the domain graph to
minimize the time of the execution and extract the most representative
candidate labels, i.e., the most closely located hypernyms.</p>
      <p>Each term without a hyponym is a leaf of the domain graph;
a node is a head that aggregates more than one term. We scrape
WPs of all leave- and node-terms to extract the text and links from
definitions, hypernyms and hyponyms (see Figure 1).</p>
      <p>We extract hyponym terms from the corresponding WP’s section.
We extract hypernym terms from two WP’s parts: (1) the hypernym
section, (2) the definition section by parsing the text of term’s
definitions and ensuring that the extracted word has its WP 2. For
example, in Figure 1, the word “Maschine” will be extracted as
an additional (in-text) hypernym to those listed in the hypernym
section.</p>
      <p>The extracted properties are assigned to each node. The
hypernyms’ links point at the WPs that may later become nodes of the
domain graph. Extraction and assignment of the WP’s properties
bridge the domain terms and heads to the Wiktionary’s knowledge
graph.
3.1.4 A priori pruning. Most of the terms of WPs have more than
one sense and some of them may be associated with the diferent
2We extract the tokens that have one of the following dependency tags: “ROOT”, “oa”=
accusative object, “oa2” = second accusative object, “app” = apposition, “cj” = conjunct.
semantic areas, e.g., technology, medicine, sport, law, etc. If a sense
belongs to only one area, the title of the area precedes the definition
explaining the sense, e.g., “Technik” in Figure 1. The step a priori
pruning determines which senses of yet to add hypernyms need to
comply with the senses of the previously added terms. Hypernyms
will become properties of a node in the domain graph if and only if
their areas belong to a predefined list of areas or if they do not list
any domain areas.</p>
      <p>
        To identify which Wiktionary’s areas determine the graph’s
domain, we select the most frequent and semantically similar areas
extracted from the senses’ definitions of the previously added leaves
and nodes. To find the most semantically similar and frequent areas,
we cluster all titles using hierarchical clustering and select the
areas from the most representative clusters. As parameters of the
hierarchical clustering, we use Euclidean distance, average linkage
criterion, and optimize the number of clusters. To represent areas’
titles in the vector space, we apply the fastText word embeddings
model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We extract three most representative clusters by: (1) selecting
a cluster with the average cosine similarity across all words in
a cluster being the highest among all clusters ( ); (2) selecting
a cluster with the count of all words in a cluster is the highest
among all clusters ( ); (3) forming an extra cluster with the 
most frequent areas  ( = 5). To identify the Wiktionary areas
 forming the domain of the graph, we intersect all three
representative clusters:  = {∪ :  ∈  ∩  ∩  }. Finally, we select
a clustering configuration that outputs the best domain-defining
areas as  = arg max6≤ ≤12 ( ( ) ·  ( )), where  are the
areas identifies at  the number of clusters,  ( ) is an average cross
similarity of the areas in  and  ( ) is sum of  ’ frequencies.
3.1.5 Graph growing. The goal of NEA is to assign the most
generalizing yet still representative labels to groups of semantically
related terms, e.g., “Person” for “Trump” and “Einstein.” To ensure
the generalization property of label candidates, we “grow”, i.e.,
expand the domain graph up, by adding new nodes on the top of the
graph from the scrapped hypernym WPs.</p>
      <p>To grow the graph, we iterate over the top nodes and create
new nodes for each of the hypernym terms. To obtain node’s
properties, we scrape WPs of the hypernym terms and extract term
definitions, hyponyms, and hypernyms. For each new node, we
add hypernym-hyponym edges between this new node and the
matching previously added nodes while also removing any edges
creating cycles in the domain graph. To avoid over-generalizing
candidate labels, we perform only one or two iterations of the graph
growing.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Automated term grouping and labeling</title>
      <p>The goal of ANEA is to obtain (named) entity categories, i.e., few
clusters of generally related terms of high cross-similarity and
assign descriptive labels to these clusters. To do so, we maximize
two parameters at the same time: cross-term group similarity and
similarity between a group of terms and a label.</p>
      <p>ANEA consists of the initial setup of the categories and three
subsequent optimization steps to improve the representativeness
of the terms and the assigned labels in the groups of terms.
3.2.1 Setup. We initialize ANEA by collecting all candidate
categories, i.e., groups of potentially related terms and the assigned
labels to them. Figure 3 depicts the process of candidate label
collection.</p>
      <p>First, we iterate over all term-nodes, i.e., the domain graph’s
leaves and nodes that were created from the extracted terms not
the hypernym WPs. For each term-node, we collect candidate
labels extracted from the names of their hypernym-nodes. Each term
obtains a list of candidate labels with various distances, i.e., the
number of edges between a term-node and a label-node. We
recursively traverse the domain graph as long as the distance between
a term-node and a label-node is  ≤  ,  = 5, or there are
hypernyms to the current node in the domain graph. During the
experiments, we notice that label-nodes with larger distances are
often rather abstract and do not characterize a term well.</p>
      <p>Second, we “transpose” all terms and their candidate labels to
obtain one label assigned to a group of terms. That is, we create
a collection of categories among which we seek to find the most
representative categories of the analyzed domain-specific text
collection.</p>
      <p>The selection of the most optimal categories among the
candidates is a double-optimization process towards two requirements:
generalization and specification. On the one hand, generalization
aims at covering categories’ broader semantics, e.g., a category
with a more general label “Person” is better than categories such
as “Actor,” “Politician,” etc. On the other hand, specification aims at
selecting the category with more narrow semantics, e.g., categories
such as “Country,” “City,” “State” provide more details about its
terms than a category “Location”.</p>
      <p>We use a quality score  to evaluate each (entity) category 
in a list of candidates:</p>
      <p>=  · · · max(log2 | |, 1) ·_
where  is a mean cross-term cosine similarity;  is a mean
labelterms similarity;  is the overall similarity, i.e.,  =  +  ; | |
is a size of a category, i.e., the number of the terms in the class;
_ is an average of non-zero distances between category’s terms
and label  : _ = |1_| Í ∈_ , where _ = {, | ∀ ∈
 : , &gt; 0} and  is a distance matrix3. If |_ | = 0, then
_ = 1.</p>
      <p>
        To calculate cosine similarities, we represent each term and label
in a vector space with fastText word embeddings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We chose
fastText for the representation of the out-of-vocabulary words,
which often occur in domain-specific texts.
      </p>
      <p>Requiring a large mean cross-term cosine similarity  increases
the specificity of a category. Typically, the smaller the number of
related terms, the larger the mean cross-similarity is. A larger mean
label-terms similarity  also increases the specificity, i.e., a large
similarity value is equivalent to a narrow descriptiveness of the
terms of a label.</p>
      <p>The overall sum  facilitates balancing potentially small values
of either  or  if the other item is still large. The large size of
a category increases its generalizing and descriptive properties,
i.e., one label needs to describe as many terms as possible. Lastly,
the average distance  _ acts as an amplifying factor for the
generalization: the higher the label  is in the domain graph, the
more general is its meaning to the terms in this category.</p>
      <p>Before the optimization steps, we perform filtering of the
candidate categories to remove low quality categories from the
candidate list. We remove an  if: (1)  &lt; 0.2, (2)  &lt; 0.3, (3)
| | &gt; 0.15 · | | ∧ | | &lt; 5, where | | is a number of
termsto-annotate, i.e., a number of the term-nodes in the domain graph.
In other words, we remove too vaguely related, very large or small
categories.
3.2.2 Resolution of full overlaps. Figure 4a depicts that if two
categories have the same terms but diferent labels, we sort the classes
by their quality scores  and keep the categories with the highest
 .
3.2.3 Resolution of substantial overlaps. Typically, categories have
overlaps between their terms albeit we find that cross-terms and
terms-label combinations of a single category are more semantically
coherent than combinations of another category. We define that
categories have substantial overlap if they share more than 50% of
their terms.</p>
      <p>Figure 4b depicts the process of conflict resolution. We construct
a matrix of replacements , i.e., a matrix indicating the quality
of a category measured by  compared to those categories with
substantial overlaps (values of  initialized with 0). The matrix is
used to identify if an  contains the best terms-label combination
or there is a better  to replace . Since ANEA’s goal is to
3Rows are the names of term-nodes and the columns are the names of term-nodes and
label-nodes. The columns contain also the term-nodes because some of the term-nodes
may not be leaves but nodes of the domain graph (see Figure 2).
annotate as many generally related terms as possible yet find as
specific categories as possible, we challenge both the size of 
and its descriptive properties.</p>
      <p>First, we sort categories by their  score in decreasing order.
Second, we intersect all categories with each other. If | ∩  | ≥
0.5 · | |, then we consider the overlaps substantial and add the
quality score  to the matrix of replacements as a , value. Note
that the matrix is squared but asymmetrical because we calculate
50% of  ’s size and not of a pairwise function of two categories,
e.g., min(| |, | |).</p>
      <p>Finally, for each  : ∀ ∈  represented by a row  in , we
select a replacement  :
 ( ) = { |</p>
      <p>∃ ∈  :
arg max  =  ∧ arg max  =  }
That is, we call  a replacement to  if  is the best among
all comparable categories to  and also  is the best among all
categories compared to itself. Also, a category can be a replacement
for itself. We keep only the unique categories that are the best
replacements { ( ) : ∀ ∈ }.
3.2.4 Resolution of conflicting terms. After resolution of
substantially overlapping terms, some categories contain minor conflicting
terms, i.e., that are present in more than one category (Figure 4c).</p>
      <p>To resolve conflicting terms, first, we create a list of “clean”
categories, i.e., from each category we remove all conflicting terms
and record categories’ labels from which the conflicting terms were
removed. Additionally, we resolve the terms of the categories that
may be also labels of another category, e.g., such a term as “h” in
Figure 3, i.e., move these label-terms to the categories with
corresponding labels. We keep all categories even if some categories may
afterward have no terms, i.e., if all their terms conflicted with other
categories.</p>
      <p>Second, to resolve conflicting terms, we estimate the quality of all
“clean” categories. We calculate a quality score  for each category
(see Section 3.2; if | | = 0, then  = 0) and sort categories by
decreasing . Sorting brings forward categories that are the most
probable to become final categories.</p>
      <p>Third, we resolve all conflicting terms by beginning with those
that belong to the categories with the highest . For each conflicting
term   and all  from which the term originated, we calculate a
similarity score :  (  ,  ) =  + , where  is the mean cosine
similarity between the vector representation of   and the remaining
terms in a “clean”  ;  is the cosine similarity between   and
a label of  . Even if a “clean”  contains no terms, i.e.,  = 0,
 will always yield  &gt; 0. We select the best category for a given
term   as:</p>
      <p>(  ) = { |∃ ∈  : arg max  (  )}</p>
      <p>We add the resolved terms to their best matching category. The
ifnal categories are where | | ≥ 5 , i.e., that represent a suficiently
large number of extracted terms from the given domain-specific
texts.</p>
    </sec>
    <sec id="sec-5">
      <title>4 EXPERIMENTS</title>
      <p>The evaluation goals are twofold. First, we seek to quantitatively
assess the quality of the automatically extracted and annotated
terms of both ANEA and a baseline using ratings from domain
experts. Second, we seek to identify a recommendation for ANEA’s
default configuration to automatically annotate texts also of other
domains by evaluating the annotated and human-assessed
annotated datasets against silver quality datasets.</p>
      <p>Due to the lack of German datasets for the analysis of
domainspecific NER, we assess the quality of the produced categories
through user studies where we ask users to rate the quality of the
entities extracted by our system. We test ANEA and compare it to
a baseline on four text datasets with four configurations.
4.1</p>
    </sec>
    <sec id="sec-6">
      <title>User study</title>
      <p>Our user study aimed at human assessment of the semantic quality
of the produced categories due to diferent configurations. We
collect feedback from human assessors for multiple configurations of
two methods: ANEA and hierarchical clustering (see Section 4.2.1).
First, we use this feedback to automatically construct silver-quality
datasets and evaluate the proposed input configurations against
them. Second, we use these silver datasets and evaluate the obtained
configurations of the dataset for ANEA to find parameters for
default configuration with which ANEA could be used to annotate
other domain datasets.
4.1.1 Test datasets. We create four text datasets of comparable size
from three diferent domains: processing industry (P), computer
science, and traveling (T). To enable both cross- and intra-domain
evaluation, we create two text datasets related to the computer
science domain: databases (D) and software development (S). Table 1
provides an overview of the datasets’ parameters, such as the overall
number of words, the number of unique terms and heads of terms
(see Sec. 3.1), and the number of human assessors per each dataset.
The table shows that the number of the unique heads may vary even
given the identical number of unique extracted terms (cf. datasets
S and T ).</p>
      <p>To test the applicability of the approach on diferent domains, we
use the publicly available data from Wikipedia and a dataset built on
private text data from a real-world production line in the processing
industry. Specifically, the first three datasets (databases, software
development, and traveling) originate from German Wikipedia articles
dedicated to the respective categories. For each dataset, we searched
for related articles in Wikipedia using a query “incategory:category”,
where “category” is “Datenbanken,” “Programmierung,” or “Reise”.
We iterated over the list of the search results sorted by relevance
and extracted the texts of the articles if the articles had a specific
number of words  : 220 ≤  ≤ 2500, i.e., articles of medium size.
The last dataset consists of reports about the daily operations of
a company in the processing industry. Such reports include texts
about statuses of the machinery, processes in the production lines,
and problems that occurred throughout the daily routines. The
dataset consists of approximately 200 short texts, each of 20-100
words.
4.1.2 Experiment setup. For the human assessment, we recruited
nine native-German speaking participants (4 f, 5 m, aged between
23-60). Each participant is familiar with the domain of the assigned
dataset(s) through their job, education, and/or hobbies.</p>
      <p>
        We assigned 3-4 participants to each dataset, and each
participant evaluated one or two datasets. Albeit the processing industry
dataset has the smallest number of unique terms, we assigned the
largest number of assessors to it due to the high relevance of
obtaining valid results for such complex, expert domains as chemistry and
technology. The vocabulary of these domains is typically strongly
underrepresented in general text corpora used to train word
embedding models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The evaluation included two tasks for the participants: (1) assess
the cross-term relatedness within the identified groups of terms and
(2) assess the relatedness of the labels automatically assigned to the
identified groups of terms. Per dataset, each participant needed to
perform an assessment of eight sheets with automated annotation
results: four identical input configurations per both ANEA and a
baseline. Each participant needed to assign a semantic relatedness
score between 0 and 9, where 0 meant no similarity and 9 - the
highest similarity.</p>
      <p>
        The input configuration included four diferent numbers of the
input terms, i.e., terms-to-annotate (TTAs), among which the
algorithms needed to extract the most representative terms that can
form a separate semantic concept, i.e., an category. To vary the size
of TTAs, we selected 1/ · 100% most frequent heads of phrases:
 ∈ [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ] for the datasets with a number of unique terms
&lt; 1000, else  ∈ [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref7">3, 4, 5, 7</xref>
        ]. By selecting only terms that share
all most frequent heads, we ensure that these terms are the most
representative of each domain-specific text.
      </p>
      <p>Table 1 reports the results of the user studies and shows that the
relatedness of the groups of terms is higher than assigned labels.
That is, the cross-term relatedness score was biased towards higher
values, the label relatedness had a more uniform score distribution.
Additionally, the mean and maximum of the relatedness scores vary
across the datasets. We noticed that the relatedness scores were
biased toward the size of the identified categories, i.e., a smaller
number of terms in categories tend to have higher scores since it is
easier for a human to assess a smaller number of items. However,
we did not find any correlation between individual datasets and
any of the outlined numeric characteristics.</p>
      <p>To estimate which input configuration and approach yielded the
most coherent categories, we require a silver dataset, which will
average the assigned scores and extract the highly rated
combinations of terms into categories, and assigned the highly rated labels
to them.
4.1.3 “Silver” datasets. The goal of a silver dataset is to ensure a
fair and unified evaluation strategy of the approaches for all topics.
We constructed a silver dataset for each topic by aggregating
information from the human assessment sheets following the identical
procedure.</p>
      <p>First, for each dataset, we constructed term-to-term and
label-toterms score matrices between the vocabulary of each topic and the
extracted and assigned labels (Figure 5). The matrices were
initialized with zeros. We iterated over the relatedness scores across two
approaches, four input configurations, and two-four human
assessors. For every two terms in a term group, we added an assigned
cross-term relatedness score to a value in a term-to-term matrix.
This score demonstrates how two terms are evaluated in various
combinations with other terms across diferent setups. After the
summation was completed, we normalized each value in the matrix
by the number of times two terms occurred together. We performed
a similar procedure with the label-to-term relatedness scores: for
each term in a category and an assigned label to a category, we
added a label-to-terms relatedness score assigned to a category and
then normalized by the number of times a label was applied to a
term.</p>
      <p>To identify a threshold of relatedness of two terms belonging
to a category, we built a histogram of all scores used to evaluate
each dataset (see Table 1). In a score range of 0-9, we decided that a
threshold of suficient relevance of terms needs to lie higher than
the mean score and not equal to the maximum value, i.e., between
scores 6-8. Thus, for each dataset, we chose the most frequent score
as a threshold and if a preceding score was less frequent by 1, then
we calculated a mean of these scores.</p>
      <p>We collected silver groups of terms, by choosing a term and
merging it with other terms, a normalized relatedness score to which is
higher or equal to the threshold of this dataset. We expanded this
list of terms with the terms that are related to any of the merged
terms compared to the threshold. Note that relatedness of all terms
to each other exceeded a relatedness threshold, but relatedness of
at least two terms needed to exceed a threshold. If a group of terms
contained at least five terms, we form a silver category. We assigned
a label to a silver category by (1) calculating mean label-to-terms
scores of all labels applied to at least two identified terms of a group,
(2) selecting a label with the maximum mean score.
4.2</p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation</title>
      <p>
        To evaluate the coherence and semantic quality of the produced
categories at various input configurations, we introduce the
evaluation methodology to evaluate ANEA and a representative baseline
against the silver datasets. By identification the input
configurations that yielded the best results, we sought to propose an optimal
default ANEA’s input configuration for any dataset.
4.2.1 Baseline: hierarchical clustering. We selected hierarchical
clustering (HC) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] as a baseline to ANEA, since it successfully
identifies semantically related terms that refer to identical entities
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Although HC does not have the functionality of automated
extraction and assignment of a label of a cluster of terms, we observed
that HC’s clusters could form meaningful categories. Therefore, we
selected HC as a baseline to compare the quality of the produced
groups of terms.
      </p>
      <p>To ensure the best performance of HC per each dataset and each
input configuration, we created an optimization of HC that selects
the best clustering results and outputs clusters that contain
maximum terms with maximum cross-term similarity. For each group of
terms, we ran HC four times with fixed parameters of cosine
similarity and average linkage criterion. We chose the linkage method,
distance metric, and the optimization of the hyperparameters for
the HC that are the most similar to the ANEA.</p>
      <p>We built clustering configurations by varying the similarity
threshold value between [0.5; 0.8] with a step of 0.1. For each
clustering configuration  , we selected only clusters , with more
than 5 terms in each (|, | ≥ 5), i.e., impose the same minimum
size requirements as for ANEA. Then, we calculated a weighted
similarity score of each parameter configuration    :
1</p>
      <p>Õ
   = Í
=0 |, | =0
, · |, |
where  is the number of clusters larger than 5 produced at a run
 , and , is a cross-term similarity within a cluster. We selected
the best configuration as  = arg max (   · Í=0 |, |),
i.e., a configuration that clusters the most terms and in the most
semantically coherent way.</p>
      <p>
        Since HC does not have label extraction and assigning
functionality, the human assessors received only one task of assessment of
the cross-term relatedness of clusters produced by HC.
4.2.2 Metrics. To evaluate the quality of the identified categories,
we use five parameters: (1) number of categories: a larger number
indicates diverse and narrowly defined categories, a smaller
number - generalizing categories, (2) number of annotated terms (AT):
property of identified relations between more terms, (3) the average
size of categories: a smaller size indicates more narrowly-defined
categories, whereas a larger size indicates more generally-related
terms in categories; (4) average cross-term score (TS), (5) average
label-to-terms score (LS), and (6) average score (AS) between TS and
LS: the high scores indicate the higher relatedness of the extracted
terms and extracted and assigned labels. The main goal of our
evaluation is to identify which input configurations lead to the highest
average score between cross-term and label-to-terms relatedness
while annotating more TTA into more general categories.
4.2.3 Results. To calculate average relatedness scores, we assigned
the scores from the normalized score matrices to the terms and
labels of the identified EC and average these scores. Table 2 reports the
evaluation results for four datasets and four input configurations.
The table shows that HC gets the highest average relatedness score
(, = 6.5) that almost reaches silver dataset (,  =
6.7) but at the same time produces categories of the size smaller
than silver categories (, = 6 and ,  = 16).
While on average, ANEA annotates the largest number of terms
(,  = 175), it also yields the lowest average relatedness
score (,  = 5.2) both compared to the silver dataset and
HC. When creating a coding book, multiple human coders first
annotate, and then the majority voting decides which excepts and
labels describe a dataset the best. We applied a similar strategy to
improve the performance of ANEA.
4.2.4 Voting strategy and default input configuration. Ensemble
learning is a common approach in machine learning to improve the
results of a classifier, i.e., by combining predictions of multiple
classifiers to achieve a boost in the overall accuracy through collecting
“the wisdom of the crowd” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>We followed this principle to improve the quality of the extracted
by ANEA categories and combine results of 2-4 configurations
in each dataset. Similar to the construction of silver datasets, we
created a category of at least five terms if these terms co-occurred in
at least two input configurations. We assigned a label that describes
the majority of the terms in the identified group.</p>
      <p>Table 3 reports that at least one of the combination of ANEA
with multiple input configurations increases the average relatedness
score compared to ANEA without a voting strategy on average by
0.7 (,  = 5.9). Although the voting approach does
not exceed the relatedness scores of the silver datasets (reaches
87.9% of the silver score), it increases a number of annotated terms
= 13) and
per category (,  = 11 and , 
also identifies more generalizing categories ( ,  = 15 and
,  = 9).</p>
      <p>To identify the best default configuration for the voting strategy
of ANEA, we selected the best performing voting strategy
configurations per each dataset and deduced a default input configuration
by generalizing these configurations. We took the minimum and
the maximum number of terms-to-annotate (TTAs) from the best
voting configuration and plot against a number of unique heads in
each dataset (see Table 1).</p>
      <p>Figure 6 depicts a linear trend between the TTAs and the unique
heads from the datasets. Based on this trend, for any other dataset,
we recommend annotating only the first  = 158 + 0.167 TTAs
that belong to the most frequent heads, where  is a number of
unique heads in a dataset. For the voting strategy, we recommend
using the input configurations of the ,  − 40, and  + 40 number
of terms that share the most frequent heads of terms.
5</p>
    </sec>
    <sec id="sec-8">
      <title>DISCUSSION AND FUTURE WORK</title>
      <p>Our evaluation shows that ANEA facilitates a faster annotation
process. Specifically, ANEA automatically performs the most
timeconsuming tasks of deriving a coding book for the annotation of a
dataset for NER. ANEA imitates the first two stages of a manual
annotation process. First, a small set of articles (≈6000-8000 words)
is used to automatically identify categories relevant to the data
of the current domain, i.e., identify and extract related terms and
assign a label to each of them (identified 4-12 categories with 44-157
assigned terms). Second, a voting strategy is applied, which aims
at increasing the validity of the derived categories by following
the idea of ensemble learning and intercoder agreement
(relatedness score improved on average from 77.4% of the silver average
relatedness score to 87.9%). To continue with manual annotation of
a (N)ER dataset, next, researchers manually validate their coding
book. If they find that the coding book suficiently represents the
dataset, they annotate the remaining texts to create a large corpus
for NER.</p>
      <p>Therefore, the primary use cases of ANEA are as follows. First,
extraction of domain categories from a subset of a large text dataset
and improve their quality with the voting strategy. Second, manual
validation and improvement of the identified categories by moving
terms between the categories and suggesting better labels to them.</p>
      <p>
        The final stage of annotation of a NER dataset is to apply a coding
book to a large dataset, i.e., read the text and assign categories to
text excerpts following guidelines or examples of the coding book.
Although, such manual text annotation is a standard approach to
create “gold”-standard datasets, the recent semi-supervised learning
neural network models (e.g., DART [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) show high potential in
reliable annotating of large text collections. We plan to use the
models like DART will complete the automation of creating
domainspecific NER datasets.
      </p>
      <p>
        Future work directions include the creation of manually
annotated datasets from scratch for multiple domains to calculate
accuracy metrics, e.g., precision, recall, and F1, to evaluate the
efectiveness of identifying categories by ANEA. Further, to improve the
quality and meaningfulness of the assigned labels, we plan to test
ANEA on other knowledge graphs, e.g., Wikidata or BabelNet. To
test the applicability of ANEA, we also plan to evaluate the approach
in other languages, e.g., English, with an additional module for the
identification of multi-word expressions similar to compound-based
German words [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Further, to improve the semantic quality of
both categories’ terms and labels in a specific domain, we plan to
use a language model, e.g., BERT, use the quality score as a learning
objective. Lastly, we seek to build a semi-supervised NER model to
complete automated annotation of NER datasets, i.e., automatically
annotate large datasets suitable for training neural network models.
We will use terms and labels from the derived categories as the
seed-terms and seed-labels and perform named entity tagging and
classify more domain terms [
        <xref ref-type="bibr" rid="ref23 ref4">4, 23</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>In this paper, we propose ANEA, an automatic approach to derive
domain entity categories from a subset of domain input texts, i.e.,
create a small dataset to train a NER model. Specifically, ANEA
identifies related domain-representative terms and automatically
extracts and assigns descriptive and generalizing labels to them
based on Wiktionary. In our user assessment and evaluation, ANEA
could not outperform a silver dataset on the relatedness scores
assigned to the groups of terms and labels describing these groups.
However, ANEA produced more generalizing domain categories
compared to a strong baseline. We showed that our voting strategy
of combining terms and labels from the categories identified at
multiple input configurations significantly improved the quality
of the final categories. Additionally, we suggested a default input
configuration that can be applied to derive categories from German
domain text datasets. Finally, we think that the best application of
ANEA is to annotate and use a small dataset in semi-supervised
learning. Moreover, we plan to improve and validate the annotations
with a domain expert, and use this small domain dataset to train
state-of-the-art NER models.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGEMENT</title>
      <p>The research for this paper has been conducted in collaboration
with the company eschbach (https://eschbach.com) supported by
the Central Innovation Programme (ZIM) of the German Federal
Ministry for Economic Afairs and Energy.</p>
      <p>We thank all study participants for their significant contribution
to this publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Darina</given-names>
            <surname>Benikova</surname>
          </string-name>
          , Chris Biemann, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Reznicek</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>NoSta-D Named Entity</surname>
          </string-name>
          <article-title>Annotation for German: Guidelines and Dataset.</article-title>
          .
          <source>In LREC</source>
          .
          <volume>2524</volume>
          -
          <fpage>2531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Michael</surname>
            <given-names>R Berthold</given-names>
          </string-name>
          , Christian Borgelt, Frank Höppner, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Klawonn</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Guide to intelligent data analysis: how to intelligently make sense of real data</article-title>
          . Springer Science &amp; Business Media.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Cambria</surname>
          </string-name>
          , Soujanya Poria, Rajiv Bajpai, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>SenticNet 4: A semantic resource for sentiment analysis based on conceptual primitives</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <article-title>the 26th international conference on computational linguistics: Technical papers</article-title>
          .
          <fpage>2666</fpage>
          -
          <lpage>2677</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ernie</given-names>
            <surname>Chang</surname>
          </string-name>
          , Jeriah Caplinger, Alex Marin,
          <string-name>
            <given-names>Xiaoyu</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Vera</given-names>
            <surname>Demberg</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations. International Committee on Computational Linguistics (ICCL)</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          . https://doi.org/10.18653/v1/
          <year>2020</year>
          .colingdemos.3
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Safaa</given-names>
            <surname>Eltyeb</surname>
          </string-name>
          and
          <string-name>
            <given-names>Naomie</given-names>
            <surname>Salim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Chemical named entities recognition: a review on approaches and applications</article-title>
          .
          <source>Journal of cheminformatics 6</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , Doug Downey,
          <string-name>
            <surname>Ana-Maria</surname>
            <given-names>Popescu</given-names>
          </string-name>
          , Tal Shaked, Stephen Soderland, Daniel S Weld, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Yates</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Unsupervised named-entity extraction from the web: An experimental study</article-title>
          .
          <source>Artificial intelligence 165</source>
          ,
          <issue>1</issue>
          (
          <year>2005</year>
          ),
          <fpage>91</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>John</given-names>
            <surname>Foley</surname>
          </string-name>
          , Sheikh Muhammad Sarwar, and
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Named Entity Recognition with Extremely Limited Data</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>04411</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Édouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and
          <string-name>
            <given-names>Tomáš</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning Word Vectors for 157 Languages</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Timm</given-names>
            <surname>Heuss</surname>
          </string-name>
          , Bernhard Humm,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Henninger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Rippl</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A comparison of NER tools wrt a domain-specific vocabulary</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Semantic Systems</source>
          .
          <volume>100</volume>
          -
          <fpage>107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Honnibal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ines</given-names>
            <surname>Montani</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          . (
          <year>2017</year>
          ). To appear.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Loster</surname>
          </string-name>
          , Felix Naumann, Jan Ehmueller, and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Feldmann</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Curex: a system for extracting, curating, and exploring domain-specific knowledge graphs from text</article-title>
          .
          <source>In Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          .
          <year>1883</year>
          -
          <fpage>1886</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Loster</surname>
          </string-name>
          , Zhe Zuo, Felix Naumann, Oliver Maspfuhl, and Dirk Thomas.
          <year>2017</year>
          .
          <article-title>Improving Company Recognition from Unstructured Text by using Dictionaries.</article-title>
          .
          <source>In EDBT</source>
          .
          <volume>610</volume>
          -
          <fpage>619</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alireza</surname>
            <given-names>Mansouri</given-names>
          </string-name>
          , Lilly Suriani Afendey, and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Mamat</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Named entity recognition approaches</article-title>
          .
          <source>International Journal of Computer Science and Network Security</source>
          <volume>8</volume>
          ,
          <issue>2</issue>
          (
          <year>2008</year>
          ),
          <fpage>339</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Christian</surname>
            <given-names>M Meyer</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>What psycholinguists know about chemistry: Aligning Wiktionary and WordNet for increased domain coverage</article-title>
          .
          <source>In Proceedings of 5th International Joint Conference on Natural Language Processing</source>
          .
          <fpage>883</fpage>
          -
          <lpage>892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Fionn</given-names>
            <surname>Murtagh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Contreras</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Algorithms for hierarchical clustering: an overview</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>2</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>86</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>David</given-names>
            <surname>Nadeau</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter D Turney</surname>
            , and
            <given-names>Stan</given-names>
          </string-name>
          <string-name>
            <surname>Matwin</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Unsupervised namedentity recognition: Generating gazetteers and resolving ambiguity</article-title>
          .
          <source>In Conference of the Canadian society for computational studies of intelligence</source>
          . Springer,
          <fpage>266</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Alexander</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Richman</surname>
            and
            <given-names>Patrick</given-names>
          </string-name>
          <string-name>
            <surname>Schone</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Mining wiki resources for multilingual named entity recognition</article-title>
          .
          <source>In Proceedings of ACL-08: HLT. 1-9.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tim</surname>
            <given-names>Rocktäschel</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Weidlich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ulf</given-names>
            <surname>Leser</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>ChemSpot: a hybrid system for chemical named entity recognition</article-title>
          .
          <source>Bioinformatics</source>
          <volume>28</volume>
          ,
          <issue>12</issue>
          (
          <year>2012</year>
          ),
          <fpage>1633</fpage>
          -
          <lpage>1640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Josef</surname>
            <given-names>Ruppenhofer</given-names>
          </string-name>
          , Ines Rehbein, and
          <string-name>
            <given-names>Carolina</given-names>
            <surname>Flinz</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fine-grained Named Entity Annotations for German Biographic Interviews</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Bahar</given-names>
            <surname>Salehi</surname>
          </string-name>
          , Paul Cook, and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Detecting noncompositional mwe components using wiktionary</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <volume>1792</volume>
          -
          <fpage>1797</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Tjong Kim Sang and Fien De Meulder</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition</article-title>
          .
          <source>In Proceedings of the Seventh Conference on Natural Language Learning</source>
          at HLT-NAACL
          <year>2003</year>
          .
          <volume>142</volume>
          -
          <fpage>147</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Martin</surname>
            <given-names>Schiersch</given-names>
          </string-name>
          , Veselina Mironova, Maximilian Schmitt, Philippe Thomas,
          <string-name>
            <given-names>Aleksandra</given-names>
            <surname>Gabryszak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Leonhard</given-names>
            <surname>Hennig</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Trafic and Industry Events</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Jingbo</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han.
          <year>2018</year>
          .
          <article-title>Learning Named Entity Tagger using Domain-Specific Dictionary</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <fpage>2054</fpage>
          -
          <lpage>2064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Eszter</given-names>
            <surname>Simon</surname>
          </string-name>
          and Dávid Márk Nemeskey.
          <year>2012</year>
          .
          <article-title>Automatically generated NE tagged corpora for English and Hungarian. Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Suzushi</surname>
            <given-names>Tomori</given-names>
          </string-name>
          , Takashi Ninomiya, and
          <string-name>
            <given-names>Shinsuke</given-names>
            <surname>Mori</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Domain specific named entity recognition referring to the real world by deep neural networks</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          .
          <fpage>236</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Don</given-names>
            <surname>Tuggener</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Incremental Coreference Resolution for German</article-title>
          .
          <source>Ph. D. Dissertation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Weber</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Web-based ontology learning with isolde</article-title>
          .
          <source>In Proc. of the ISWC Workshop on Web Content Mining with Human Language Technologies</source>
          . Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Vikas</given-names>
            <surname>Yadav</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bethard</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Survey on Recent Advances in Named Entity Recognition from Deep Learning models</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          .
          <fpage>2145</fpage>
          -
          <lpage>2158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Torsten</surname>
            <given-names>Zesch</given-names>
          </string-name>
          , Christof Müller, and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>LREC</given-names>
          </string-name>
          , Vol.
          <volume>8</volume>
          .
          <fpage>1646</fpage>
          -
          <lpage>1652</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Shaodian</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Noémie</given-names>
            <surname>Elhadad</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts</article-title>
          .
          <source>Journal of biomedical informatics 46</source>
          ,
          <issue>6</issue>
          (
          <year>2013</year>
          ),
          <fpage>1088</fpage>
          -
          <lpage>1098</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Jie</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Bi-cheng
          <string-name>
            <surname>Li</surname>
            , and
            <given-names>Gang</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Automatically building largescale named entity recognition corpora from Chinese Wikipedia</article-title>
          .
          <source>Frontiers of Information Technology &amp; Electronic Engineering</source>
          <volume>16</volume>
          ,
          <issue>11</issue>
          (
          <year>2015</year>
          ),
          <fpage>940</fpage>
          -
          <lpage>956</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>