=Paper= {{Paper |id=Vol-3004/paper1 |storemode=property |title=ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts |pdfUrl=https://ceur-ws.org/Vol-3004/paper1.pdf |volume=Vol-3004 |authors=Anastasia Zhukova,Felix Hamborg,Bela Gipp |dblpUrl=https://dblp.org/rec/conf/jcdl/ZhukovaHG21 }} ==ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts== https://ceur-ws.org/Vol-3004/paper1.pdf

EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

ANEA: Automated (Named) Entity Annotation for German
Domain-Specific Texts
Anastasia Zhukova∗ Felix Hamborg† Bela Gipp∗
zhukova@uni-wuppertal.de felix.hamborg@uni-konstanz.de gipp@uni-wuppertal.de
University of Wuppertal University of Konstanz University of Wuppertal
Germany Germany Germany

ABSTRACT groups of terms. Not all of the domain categories may be named,
Named entity recognition (NER) is an important task that aims e.g., machinery or process.
to resolve universal categories of named entities, e.g., persons, lo- By automating the most labor-intense parts, the proposed un-
cations, organizations, and times. Despite its common and viable supervised approach minimizes the cost of expensive and labori-
use in many use cases, NER is barely applicable in domains where ous annotations required for the creation of domain-specific NER
general categories are suboptimal, such as engineering or medicine. datasets. Typical manual tasks in annotations include (1) reading
To facilitate NER of domain-specific types, we propose ANEA, an the domain text multiple times, (2) deriving entities based on the
automated (named) entity annotator to assist human annotators in text content, and (3) manually selecting terms that match the de-
creating domain-specific NER corpora for German text collections rived categories. ANEA substitutes the most time-consuming task
when given a set of domain-specific texts. In our evaluation, we of deriving a coding book and automatically defines categories
find that ANEA automatically identifies terms that best represent and annotates the most representative terms (nouns) into these
the texts’ content, identifies groups of coherent terms, and extracts categories. We evaluate the approach with user studies on multiple
and assigns descriptive labels to these groups, i.e., annotates text domain datasets against multiple silver datasets and discuss a de-
datasets into the domain (named) entities. fault input configuration for ANEA to annotate other domain NER
datasets1 .
CCS CONCEPTS
• Information systems → Information extraction; • Computing 2 RELATED WORK
methodologies → Information extraction; Language resources; Clus- NER datasets usually contain standard types, e.g., person, location,
ter analysis. organization, and are manually annotated [1, 21] or automatically
extracted [17, 24, 31]. Domain-specific NER typically needs to in-
KEYWORDS troduce domain-specific (sub-)categories of the established named
information extraction, low-resource languages, named entity recog- entity (NE) categories or entirely new categories. This is because
nition, domain-specific texts domain-specific texts contain NE categories that are (1) detailed
variants of the standard NE categories, e.g., “Person” is replaced
1 INTRODUCTION with the domain-specific sub-categories “Players” and “Coaches”
[27], (2) standard NE categories extended with a small number of
Named entity recognition (NER), a common preprocessing step
new categories, e.g., “Trigger of a traffic jam” [11, 19, 22], and (3)
in natural language processing (NLP) for various tasks, such as
domain-derived NE categories, e.g., “Proteins” in biology or “Reac-
information extraction, summarization, question and answering,
tions” in chemistry domains [9, 18, 25, 30]. Most domain-derived NE
and text understanding, is often criticized for capably representing
categories originate from structured classifications or dictionaries
only datasets with few general categories, e.g., person, location,
[9, 12, 25] or are derived by manually unifying multiple of them [5].
organization, and time (including their subcategories) [5, 30]. While
In sum, creating domain-specific datasets for NER requires expert
the original NER task contains only few categories, a rapidly in-
knowledge and is time-consuming.
creasing number of NER applications show a high demand for the
To minimize such efforts, some NER approaches use seed-NEs,
datasets with domain-specific named entities [18].
i.e., a small number of manually provided terms and their NE-
To (semi-)automatically create large general-purpose NER cor-
categories [7, 16, 30]. Such approaches use the seed-NEs as exam-
pora, recent research projects extensively use structured domain
ples to extract patterns of NE definitions and apply them to the
sources, such as dictionaries, knowledge graphs, and Wikipedia or
full text suggested for annotation. These NER approaches suffer
other knowledge bases [17, 24, 31].
from the slow updates of the underlying domain knowledge bases
In this paper, we propose ANEA, an unsupervised Wiktionary-
(KB) [28] and perform worse on lower-resource languages than
based approach that automatically derives domain entities from Ger-
on English [9]. An alternative to domain-KBs is community KBs,
man texts, i.e., low-resource language, by (1) extracting terms from
such as Wikipedia and Wiktionary, which are constantly updated
topically-related domain texts, (2) identifying the most domain-
by their communities. They prove to contain a sufficient amount of
representative, i.e., semantically distinct, terms of the analyzed
domain information [14, 29].
texts, and (3) automatically annotating the terms, i.e., ANEA ex-
tracts labels from Wiktionary and assigns them to the identified
1 https://github.com/anastasia-zhukova/ANEA

Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
(CC BY 4.0).

5
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Unlike the existing supervised approaches for annotating domain- terms are mapped to their respective German WP, if any, i.e., each
specific named entities [6, 13], in this paper, we explore ANEA, an term gets assigned to a link of a WP.
unsupervised method to support researchers and users during the In German texts, we find that many domain-specific terms are
creation of a coding book. Given a set of domain- or use case-specific compound words, i.e., words that consist of more than one noun
documents, ANEA automatically derives domain-specific categories component, for example, “Sechszylindermotor” = “sechs” + “Zylin-
and exemplary terms within. This way, ANEA automates the most der” + “Motor” (six-cylinder motor). Typically, such complex domain
time-intensive, previously manual tasks. As a consequence, users compound words are not described in Wiktionary since they are
only need to revise these terms, e.g., by renaming the categories or too rare or specific. On the contrary, we observe that compound
re-annotating the not matching to the categories terms. words’ heads, i.e., the part of a composition term that bears the
core meaning of the phrase, e.g., “Motor” in the example above, are
3 METHODOLOGY highly likely to have a WP in Wiktionary.
We propose an unsupervised approach for annotation of domain- To map rare domain-specific terms to WPs, we extract the heads
specific (named) entities (ANEA) on a lower-resource language. The of the extracted terms with a compound splitter, i.e., a model that
goal of ANEA is to fully automatically derive entity categories (later splits terms into two parts, compound- and head-parts, [26] and
in the text: categories) by selecting groups of related terms and ex- attempt to map the heads to Wiktionary. If a (multi-token) term
tract and assign a meaningful label to these terms. To do so, ANEA, has a corresponding WP, we set a full term as a term’s head. If the
first, links terms extracted from domain-specific texts to pages in compound splitter outputs a head that is not a part of Wiktionary,
Wiktionary [14, 24]. Second, ANEA automatically identifies groups we continue recursively search for a head that can be mapped to a
of related terms and automatically labels them by performing a WP. If no heads have corresponding WPs, then we do not assign a
double optimization task of both maximizations of cross-similarity head to a term.
of terms in a group and the average similarity of these terms to a The preprocessing could be changed to include the terms with
candidate label. That is, the approach consists of two main steps: (1) digits, but for now, we focused on the noun phrases as terms. If a
text preprocessing, i.e., term mapping to Wiktionary pages (WPs) term or its head of compound phrase do not have a WP, they are ex-
and construction of a domain graph, and (2) identification of related cluded from annotation because the absence of a link to Wiktionary
terms and label assignment. leads to an inability to map terms to potential category labels. Later,
such discarded terms can be manually classified by human annota-
3.1 Preprocessing and domain graph tors or automatically with state-of-the-art NER models trained on
the automatically created domain datasets.
3.1.1 Preprocessing. The goal of preprocessing is to extract terms We use fastText to vectorize extracted terms and candidate la-
from the set of texts and maximize the number of terms aligned to bels [8]. We chose fastText due to its ability to vectorize out-of-
the Wiktionary structure, i.e., map the domain-specific terminology vocabulary words, which often happens with domain-specific ter-
to the structured knowledge base. The mapping of the extracted minology.
terms to the knowledge graph enables using their semantic infor-
mation, such as term definitions, areas, hypernyms, and hyponyms
(see Figure 1). 3.1.2 Domain graph. The domain graph is a locally stored knowl-
edge graph where the leaves are the extracted domain terms. Nodes
are all terms obtained from the WPs linked to the leaves and to
each other with hyponymy-hypernymy relations. Figure 2a depicts
the principle of a domain graph. The construction of the domain
graph includes three steps: (1) graph initialization, i.e., extraction
of the WP properties, e.g., definitions, with the scraping of the WPs
assigned to the domain terms and their heads; (2) determination of
pruning criteria of Wiktionary graph to scrape only domain related
pages; (3) expansion of the domain graph, i.e., scrapping of the
hypernym pages, to create a pool of candidate labels to later anno-
tated the identified groups of terms. Figure 2b shows the process of
domain graph construction.

3.1.3 Initialization. To initialize the graph, we use the extracted
terms to which we mapped WPs and scrap the mapped WPs to
extract WPs’ properties, e.g., hypernyms. As a preliminary step of
the graph initialization, we group the extracted terms by their head.
Figure 1: An example of a Wiktionary page (WP).
The head grouping aims at the extraction of the initial hyponym-
hypernym relation for the domain graph. Then, we sort the list of
The preprocessing steps include parsing and part-of-speech heads in decreasing order by (1) the number of unique terms with
(POS) tagging using spaCy [10]. We define a term as any unique each head, (2) the frequency of the overall in-text occurrence of
noun phrase that does not contain any digits [7]. After extraction, words with such head.

6
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Figure 2: (a) A domain graph is a combination of Wiktionary and domain terminology extracted from text. (b) Domain graph
construction: initialization and expansion of the graph.

To maximize the descriptiveness and generalization of the terms semantic areas, e.g., technology, medicine, sport, law, etc. If a sense
that will become annotated into categories, we initialize the do- belongs to only one area, the title of the area precedes the definition
main graph with the terms-to-annotate that belong to the top 𝑀 explaining the sense, e.g., “Technik” in Figure 1. The step a priori
largest head groups, i.e., containing the most lexically diverse and/or pruning determines which senses of yet to add hypernyms need to
frequent terms. Section 4 determines an optimal value of terms-to- comply with the senses of the previously added terms. Hypernyms
annotate and the largest head groups the series of experiments. This will become properties of a node in the domain graph if and only if
filtering procedure reduces the size of the domain graph to mini- their areas belong to a predefined list of areas or if they do not list
mize the time of the execution and extract the most representative any domain areas.
candidate labels, i.e., the most closely located hypernyms. To identify which Wiktionary’s areas determine the graph’s
Each term without a hyponym is a leaf of the domain graph; domain, we select the most frequent and semantically similar areas
a node is a head that aggregates more than one term. We scrape extracted from the senses’ definitions of the previously added leaves
WPs of all leave- and node-terms to extract the text and links from and nodes. To find the most semantically similar and frequent areas,
definitions, hypernyms and hyponyms (see Figure 1). we cluster all titles using hierarchical clustering and select the
We extract hyponym terms from the corresponding WP’s section. areas from the most representative clusters. As parameters of the
We extract hypernym terms from two WP’s parts: (1) the hypernym hierarchical clustering, we use Euclidean distance, average linkage
section, (2) the definition section by parsing the text of term’s criterion, and optimize the number of clusters. To represent areas’
definitions and ensuring that the extracted word has its WP2 . For titles in the vector space, we apply the fastText word embeddings
example, in Figure 1, the word “Maschine” will be extracted as model [8].
an additional (in-text) hypernym to those listed in the hypernym We extract three most representative clusters by: (1) selecting
section. a cluster with the average cosine similarity across all words in
The extracted properties are assigned to each node. The hyper- a cluster being the highest among all clusters (𝐶𝑠 ); (2) selecting
nyms’ links point at the WPs that may later become nodes of the a cluster with the count of all words in a cluster is the highest
domain graph. Extraction and assignment of the WP’s properties among all clusters (𝐶 𝑓 ); (3) forming an extra cluster with the 𝐾
bridge the domain terms and heads to the Wiktionary’s knowledge most frequent areas 𝐶𝑘 (𝐾 = 5). To identify the Wiktionary areas
graph. 𝐴 forming the domain of the graph, we intersect all three repre-
sentative clusters: 𝐴 = {∪𝑎 : 𝑎 ∈ 𝐶𝑠 ∩ 𝐶 𝑓 ∩ 𝐶𝑘 }. Finally, we select
3.1.4 A priori pruning. Most of the terms of WPs have more than
a clustering configuration that outputs the best domain-defining
one sense and some of them may be associated with the different
areas as 𝐴𝑏𝑒𝑠𝑡 = arg max6≤𝑖 ≤12 (𝑠 (𝐴𝑖 ) · 𝑓 (𝐴𝑖 )), where 𝐴𝑖 are the
2 We extract the tokens that have one of the following dependency tags: “ROOT”, “oa”= areas identifies at 𝑖 the number of clusters, 𝑠 (𝐴𝑖 ) is an average cross
accusative object, “oa2” = second accusative object, “app” = apposition, “cj” = conjunct. similarity of the areas in 𝐴𝑖 and 𝑓 (𝐴𝑖 ) is sum of 𝐴𝑖 ’ frequencies.

7
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

3.1.5 Graph growing. The goal of NEA is to assign the most gen-
eralizing yet still representative labels to groups of semantically
related terms, e.g., “Person” for “Trump” and “Einstein.” To ensure
the generalization property of label candidates, we “grow”, i.e., ex-
pand the domain graph up, by adding new nodes on the top of the
graph from the scrapped hypernym WPs.
To grow the graph, we iterate over the top nodes and create
new nodes for each of the hypernym terms. To obtain node’s prop-
erties, we scrape WPs of the hypernym terms and extract term
definitions, hyponyms, and hypernyms. For each new node, we
add hypernym-hyponym edges between this new node and the
matching previously added nodes while also removing any edges
creating cycles in the domain graph. To avoid over-generalizing
candidate labels, we perform only one or two iterations of the graph
growing.

3.2 Automated term grouping and labeling
The goal of ANEA is to obtain (named) entity categories, i.e., few
clusters of generally related terms of high cross-similarity and
assign descriptive labels to these clusters. To do so, we maximize
two parameters at the same time: cross-term group similarity and
similarity between a group of terms and a label.
ANEA consists of the initial setup of the categories and three
subsequent optimization steps to improve the representativeness
of the terms and the assigned labels in the groups of terms.
3.2.1 Setup. We initialize ANEA by collecting all candidate cat-
egories, i.e., groups of potentially related terms and the assigned
labels to them. Figure 3 depicts the process of candidate label col-
lection.

Figure 4: (a) Resolution of the full overlaps. (b) Resolution
of the substantial overlaps. (c) Resolution of the conflicting
terms.

the hypernym WPs. For each term-node, we collect candidate la-
bels extracted from the names of their hypernym-nodes. Each term
obtains a list of candidate labels with various distances, i.e., the
number of edges between a term-node and a label-node. We recur-
sively traverse the domain graph as long as the distance between
a term-node and a label-node is 𝑑 ≤ 𝑑𝑚𝑎𝑥 , 𝑑𝑚𝑎𝑥 = 5, or there are
hypernyms to the current node in the domain graph. During the
experiments, we notice that label-nodes with larger distances are
often rather abstract and do not characterize a term well.
Second, we “transpose” all terms and their candidate labels to
obtain one label assigned to a group of terms. That is, we create
a collection of categories among which we seek to find the most
representative categories of the analyzed domain-specific text col-
lection.
The selection of the most optimal categories among the candi-
dates is a double-optimization process towards two requirements:
Figure 3: ANEA setup: a collection of candidate entity cate-
generalization and specification. On the one hand, generalization
gories.
aims at covering categories’ broader semantics, e.g., a category
with a more general label “Person” is better than categories such
First, we iterate over all term-nodes, i.e., the domain graph’s as “Actor,” “Politician,” etc. On the other hand, specification aims at
leaves and nodes that were created from the extracted terms not selecting the category with more narrow semantics, e.g., categories

8
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

such as “Country,” “City,” “State” provide more details about its annotate as many generally related terms as possible yet find as
terms than a category “Location”. specific categories as possible, we challenge both the size of 𝐸𝐶𝑎
We use a quality score 𝑄𝑖 to evaluate each (entity) category 𝐸𝐶𝑖 and its descriptive properties.
in a list of candidates: First, we sort categories by their 𝑄 score in decreasing order.
𝑄𝑖 = 𝑇𝑖 · 𝐿𝑖 · 𝑂𝑖 · max(log2 |𝐸𝐶𝑖 |, 1) · 𝑑𝑎𝑣𝑔_𝑖 Second, we intersect all categories with each other. If |𝐸𝐶𝑎 ∩𝐸𝐶𝑏 | ≥
0.5 · |𝐸𝐶𝑎 |, then we consider the overlaps substantial and add the
where 𝑇𝑖 is a mean cross-term cosine similarity; 𝐿𝑖 is a mean label- quality score 𝑄𝑏 to the matrix of replacements as a 𝑅𝑎,𝑏 value. Note
terms similarity; 𝑂𝑖 is the overall similarity, i.e., 𝑂𝑖 = 𝑇𝑖 + 𝐿𝑖 ; |𝐸𝐶𝑖 | that the matrix is squared but asymmetrical because we calculate
is a size of a category, i.e., the number of the terms in the class; 50% of 𝐸𝐶𝑎 ’s size and not of a pairwise function of two categories,
𝑑𝑎𝑣𝑔_𝑖 is an average of non-zero distances between category’s terms e.g., min(|𝐸𝐶𝑎 |, |𝐸𝐶𝑏 |).
and label 𝑙: 𝑑𝑎𝑣𝑔_𝑖 = |𝐷 1
Í
𝑑 ∈𝐷𝑛𝑛_𝑖 𝑑, where 𝐷𝑛𝑛_𝑖 = {𝐷𝑖,𝑙 | ∀𝑖 ∈
𝑛𝑛_𝑖 | Finally, for each 𝐸𝐶𝑎 : ∀𝑎 ∈ 𝐴 represented by a row 𝑟𝑎 in 𝑅, we
𝐸𝐶 : 𝐷𝑖,𝑙 > 0} and 𝐷 is a distance matrix3 . If |𝐷𝑛𝑛_𝑖 | = 0, then select a replacement 𝐸𝐶𝑟𝑒𝑝𝑙 :
𝑑𝑎𝑣𝑔_𝑖 = 1.
𝐸𝐶𝑟𝑒𝑝𝑙 (𝐸𝐶𝑎 ) = {𝐸𝐶𝑐 | ∃𝑐 ∈ 𝐴 :
To calculate cosine similarities, we represent each term and label
in a vector space with fastText word embeddings [8]. We chose arg max 𝑟𝑎 = 𝐸𝐶𝑐 ∧ arg max 𝑟𝑐 = 𝐸𝐶𝑐 }
fastText for the representation of the out-of-vocabulary words, That is, we call 𝐸𝐶𝑐 a replacement to 𝐸𝐶𝑎 if 𝐸𝐶𝑐 is the best among
which often occur in domain-specific texts. all comparable categories to 𝐸𝐶𝑎 and also 𝐸𝐶𝑐 is the best among all
Requiring a large mean cross-term cosine similarity 𝑇𝑖 increases categories compared to itself. Also, a category can be a replacement
the specificity of a category. Typically, the smaller the number of for itself. We keep only the unique categories that are the best
related terms, the larger the mean cross-similarity is. A larger mean replacements {𝐸𝐶𝑟𝑒𝑝𝑙 (𝐸𝑎 ) : ∀𝑎 ∈ 𝐴}.
label-terms similarity 𝐿𝑖 also increases the specificity, i.e., a large
similarity value is equivalent to a narrow descriptiveness of the 3.2.4 Resolution of conflicting terms. After resolution of substan-
terms of a label. tially overlapping terms, some categories contain minor conflicting
The overall sum 𝑂𝑖 facilitates balancing potentially small values terms, i.e., that are present in more than one category (Figure 4c).
of either 𝑇𝑖 or 𝐿𝑖 if the other item is still large. The large size of To resolve conflicting terms, first, we create a list of “clean” cat-
a category increases its generalizing and descriptive properties, egories, i.e., from each category we remove all conflicting terms
i.e., one label needs to describe as many terms as possible. Lastly, and record categories’ labels from which the conflicting terms were
the average distance 𝑑𝑎𝑣𝑔 _𝑖 acts as an amplifying factor for the removed. Additionally, we resolve the terms of the categories that
generalization: the higher the label 𝑙 is in the domain graph, the may be also labels of another category, e.g., such a term as “h” in
more general is its meaning to the terms in this category. Figure 3, i.e., move these label-terms to the categories with corre-
Before the optimization steps, we perform filtering of the can- sponding labels. We keep all categories even if some categories may
didate categories to remove low quality categories from the can- afterward have no terms, i.e., if all their terms conflicted with other
didate list. We remove an 𝐸𝐶𝑖 if: (1) 𝑇𝑖 < 0.2, (2) 𝐿𝑖 < 0.3, (3) categories.
|𝐸𝐶𝑖 | > 0.15 · |𝑇𝑇 𝐴| ∧ |𝐸𝐶𝑖 | < 5, where |𝑇𝑇 𝐴| is a number of terms- Second, to resolve conflicting terms, we estimate the quality of all
to-annotate, i.e., a number of the term-nodes in the domain graph. “clean” categories. We calculate a quality score 𝑄 for each category
In other words, we remove too vaguely related, very large or small (see Section 3.2; if |𝐸𝐶 | = 0, then 𝑄 = 0) and sort categories by
categories. decreasing 𝑄. Sorting brings forward categories that are the most
probable to become final categories.
3.2.2 Resolution of full overlaps. Figure 4a depicts that if two cate- Third, we resolve all conflicting terms by beginning with those
gories have the same terms but different labels, we sort the classes that belong to the categories with the highest 𝑄. For each conflicting
by their quality scores 𝑄𝑖 and keep the categories with the highest term 𝑡 𝑗 and all 𝐸𝐶𝑖 from which the term originated, we calculate a
𝑄𝑖 . similarity score 𝑆: 𝑆 (𝑡 𝑗 , 𝐸𝐶𝑖 ) = 𝑇 + 𝐿, where 𝑇 is the mean cosine
3.2.3 Resolution of substantial overlaps. Typically, categories have similarity between the vector representation of 𝑡 𝑗 and the remaining
overlaps between their terms albeit we find that cross-terms and terms in a “clean” 𝐸𝐶𝑖 ; 𝐿 is the cosine similarity between 𝑡 𝑗 and
terms-label combinations of a single category are more semantically a label of 𝐸𝐶𝑖 . Even if a “clean” 𝐸𝐶𝑖 contains no terms, i.e., 𝑇 = 0,
coherent than combinations of another category. We define that 𝐿 will always yield 𝑆 > 0. We select the best category for a given
categories have substantial overlap if they share more than 50% of term 𝑡 𝑗 as:
their terms. 𝐸𝐶𝑏𝑒𝑠𝑡 (𝑡 𝑗 ) = {𝐸𝐶𝑖 |∃𝑖 ∈ 𝐴 : arg max 𝑆𝑖 (𝑡 𝑗 )}
Figure 4b depicts the process of conflict resolution. We construct
a matrix of replacements 𝑅, i.e., a matrix indicating the quality We add the resolved terms to their best matching category. The
of a category measured by 𝑄 compared to those categories with final categories are where |𝐸𝐶 | ≥ 5 , i.e., that represent a sufficiently
substantial overlaps (values of 𝑅 initialized with 0). The matrix is large number of extracted terms from the given domain-specific
used to identify if an 𝐸𝐶𝐴 contains the best terms-label combination texts.
or there is a better 𝐸𝐶𝑟𝑒𝑝𝑙 to replace 𝐸𝐶𝐴 . Since ANEA’s goal is to
4 EXPERIMENTS
3 Rows are the names of term-nodes and the columns are the names of term-nodes and
label-nodes. The columns contain also the term-nodes because some of the term-nodes
The evaluation goals are twofold. First, we seek to quantitatively
may not be leaves but nodes of the domain graph (see Figure 2). assess the quality of the automatically extracted and annotated

9
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

terms of both ANEA and a baseline using ratings from domain private text data from a real-world production line in the processing
experts. Second, we seek to identify a recommendation for ANEA’s industry. Specifically, the first three datasets (databases, software de-
default configuration to automatically annotate texts also of other velopment, and traveling) originate from German Wikipedia articles
domains by evaluating the annotated and human-assessed anno- dedicated to the respective categories. For each dataset, we searched
tated datasets against silver quality datasets. for related articles in Wikipedia using a query “incategory:category”,
Due to the lack of German datasets for the analysis of domain- where “category” is “Datenbanken,” “Programmierung,” or “Reise”.
specific NER, we assess the quality of the produced categories We iterated over the list of the search results sorted by relevance
through user studies where we ask users to rate the quality of the and extracted the texts of the articles if the articles had a specific
entities extracted by our system. We test ANEA and compare it to number of words 𝑊 : 220 ≤ 𝑊 ≤ 2500, i.e., articles of medium size.
a baseline on four text datasets with four configurations. The last dataset consists of reports about the daily operations of
a company in the processing industry. Such reports include texts
4.1 User study about statuses of the machinery, processes in the production lines,
Our user study aimed at human assessment of the semantic quality and problems that occurred throughout the daily routines. The
of the produced categories due to different configurations. We col- dataset consists of approximately 200 short texts, each of 20-100
lect feedback from human assessors for multiple configurations of words.
two methods: ANEA and hierarchical clustering (see Section 4.2.1).
First, we use this feedback to automatically construct silver-quality 4.1.2 Experiment setup. For the human assessment, we recruited
datasets and evaluate the proposed input configurations against nine native-German speaking participants (4 f, 5 m, aged between
them. Second, we use these silver datasets and evaluate the obtained 23-60). Each participant is familiar with the domain of the assigned
configurations of the dataset for ANEA to find parameters for de- dataset(s) through their job, education, and/or hobbies.
fault configuration with which ANEA could be used to annotate We assigned 3-4 participants to each dataset, and each partici-
other domain datasets. pant evaluated one or two datasets. Albeit the processing industry
dataset has the smallest number of unique terms, we assigned the
4.1.1 Test datasets. We create four text datasets of comparable size largest number of assessors to it due to the high relevance of obtain-
from three different domains: processing industry (P), computer ing valid results for such complex, expert domains as chemistry and
science, and traveling (T). To enable both cross- and intra-domain technology. The vocabulary of these domains is typically strongly
evaluation, we create two text datasets related to the computer underrepresented in general text corpora used to train word em-
science domain: databases (D) and software development (S). Table 1 bedding models [8].
provides an overview of the datasets’ parameters, such as the overall The evaluation included two tasks for the participants: (1) assess
number of words, the number of unique terms and heads of terms the cross-term relatedness within the identified groups of terms and
(see Sec. 3.1), and the number of human assessors per each dataset. (2) assess the relatedness of the labels automatically assigned to the
The table shows that the number of the unique heads may vary even identified groups of terms. Per dataset, each participant needed to
given the identical number of unique extracted terms (cf. datasets perform an assessment of eight sheets with automated annotation
S and T ). results: four identical input configurations per both ANEA and a
baseline. Each participant needed to assign a semantic relatedness
Table 1: Dataset statistics: databases (D), software develop- score between 0 and 9, where 0 meant no similarity and 9 - the
ment (S), travelling (T), and processing industry (P). The his- highest similarity.
tograms show the distribution of the user-assigned related- The input configuration included four different numbers of the
ness score to categories extracted with various input config- input terms, i.e., terms-to-annotate (TTAs), among which the algo-
urations per dataset. Bold scores indicate a threshold used rithms needed to extract the most representative terms that can
to construct silver datasets. form a separate semantic concept, i.e., an category. To vary the size
of TTAs, we selected 1/𝑍 · 100% most frequent heads of phrases:
Dataset D S T P 𝑍 ∈ [2, 3, 4, 5] for the datasets with a number of unique terms
All words 8161 8581 6293 7984 < 1000, else 𝑍 ∈ [3, 4, 5, 7]. By selecting only terms that share
Terms 1209 1041 1040 552 all most frequent heads, we ensure that these terms are the most
Heads 713 673 801 328
representative of each domain-specific text.
Assessors 3 3 2 4
Table 1 reports the results of the user studies and shows that the
User study:
1. cross-term
relatedness of the groups of terms is higher than assigned labels.
relatedness That is, the cross-term relatedness score was biased towards higher
scores values, the label relatedness had a more uniform score distribution.
Additionally, the mean and maximum of the relatedness scores vary
2. label-terms
relatedness
across the datasets. We noticed that the relatedness scores were
scores biased toward the size of the identified categories, i.e., a smaller
number of terms in categories tend to have higher scores since it is
easier for a human to assess a smaller number of items. However,
To test the applicability of the approach on different domains, we we did not find any correlation between individual datasets and
use the publicly available data from Wikipedia and a dataset built on any of the outlined numeric characteristics.

10
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Figure 5: Construction of a silver dataset through collecting of information from user studies.

To estimate which input configuration and approach yielded the to each other exceeded a relatedness threshold, but relatedness of
most coherent categories, we require a silver dataset, which will at least two terms needed to exceed a threshold. If a group of terms
average the assigned scores and extract the highly rated combina- contained at least five terms, we form a silver category. We assigned
tions of terms into categories, and assigned the highly rated labels a label to a silver category by (1) calculating mean label-to-terms
to them. scores of all labels applied to at least two identified terms of a group,
(2) selecting a label with the maximum mean score.
4.1.3 “Silver” datasets. The goal of a silver dataset is to ensure a
fair and unified evaluation strategy of the approaches for all topics. 4.2 Evaluation
We constructed a silver dataset for each topic by aggregating infor- To evaluate the coherence and semantic quality of the produced
mation from the human assessment sheets following the identical categories at various input configurations, we introduce the evalua-
procedure. tion methodology to evaluate ANEA and a representative baseline
First, for each dataset, we constructed term-to-term and label-to- against the silver datasets. By identification the input configura-
terms score matrices between the vocabulary of each topic and the tions that yielded the best results, we sought to propose an optimal
extracted and assigned labels (Figure 5). The matrices were initial- default ANEA’s input configuration for any dataset.
ized with zeros. We iterated over the relatedness scores across two
approaches, four input configurations, and two-four human asses- 4.2.1 Baseline: hierarchical clustering. We selected hierarchical
sors. For every two terms in a term group, we added an assigned clustering (HC) [15] as a baseline to ANEA, since it successfully
cross-term relatedness score to a value in a term-to-term matrix. identifies semantically related terms that refer to identical entities
This score demonstrates how two terms are evaluated in various [3]. Although HC does not have the functionality of automated ex-
combinations with other terms across different setups. After the traction and assignment of a label of a cluster of terms, we observed
summation was completed, we normalized each value in the matrix that HC’s clusters could form meaningful categories. Therefore, we
by the number of times two terms occurred together. We performed selected HC as a baseline to compare the quality of the produced
a similar procedure with the label-to-term relatedness scores: for groups of terms.
each term in a category and an assigned label to a category, we To ensure the best performance of HC per each dataset and each
added a label-to-terms relatedness score assigned to a category and input configuration, we created an optimization of HC that selects
then normalized by the number of times a label was applied to a the best clustering results and outputs clusters that contain maxi-
term. mum terms with maximum cross-term similarity. For each group of
To identify a threshold of relatedness of two terms belonging terms, we ran HC four times with fixed parameters of cosine simi-
to a category, we built a histogram of all scores used to evaluate larity and average linkage criterion. We chose the linkage method,
each dataset (see Table 1). In a score range of 0-9, we decided that a distance metric, and the optimization of the hyperparameters for
threshold of sufficient relevance of terms needs to lie higher than the HC that are the most similar to the ANEA.
the mean score and not equal to the maximum value, i.e., between We built clustering configurations by varying the similarity
scores 6-8. Thus, for each dataset, we chose the most frequent score threshold value between [0.5; 0.8] with a step of 0.1. For each clus-
as a threshold and if a preceding score was less frequent by 1, then tering configuration 𝑗, we selected only clusters 𝐶𝐿 𝑗,𝑖 with more
we calculated a mean of these scores. than 5 terms in each (|𝐶𝐿 𝑗,𝑖 | ≥ 5), i.e., impose the same minimum
We collected silver groups of terms, by choosing a term and merg- size requirements as for ANEA. Then, we calculated a weighted
ing it with other terms, a normalized relatedness score to which is similarity score of each parameter configuration 𝑊 𝑆 𝑗 :
higher or equal to the threshold of this dataset. We expanded this 𝐼
list of terms with the terms that are related to any of the merged 1 Õ
𝑊 𝑆 𝑗 = Í𝐼 𝑇 𝑗,𝑖 · |𝐶𝐿 𝑗,𝑖 |
terms compared to the threshold. Note that relatedness of all terms 𝑖=0 |𝐶𝐿 𝑗,𝑖 | 𝑖=0

11
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

where 𝐼 is the number of clusters larger than 5 produced at a run Table 2: Evaluation of the entity identification methods in
𝑗, and 𝑇 𝑗,𝑖 is a cross-term similarity within a cluster. We selected various input configuration. Z is the denominator to choose
the best configuration as 𝐶𝑏𝑒𝑠𝑡 = arg max 𝑗 (𝑊 𝑆 𝑗 · 𝑖=0
Í𝐼
|𝐶𝐿 𝑗,𝑖 |), terms-to-annotate from the most 1/𝑍 frequent terms’ heads;
i.e., a configuration that clusters the most terms and in the most TTA is the number of terms-to-annotate, EC is the number of
semantically coherent way. the (entity) categories that were produced by each approach;
Since HC does not have label extraction and assigning function- AT is the number of annotated terms, i.e., that belong to the
ality, the human assessors received only one task of assessment of identified categories; TS is the mean cross-term relatedness
the cross-term relatedness of clusters produced by HC. score among the categories; LS is the mean label-to-terms re-
latedness score; AS is the average score between TS and LS
4.2.2 Metrics. To evaluate the quality of the identified categories, (* means that AS is equal to TS because HC does not assign
we use five parameters: (1) number of categories: a larger number labels to groups of terms).
indicates diverse and narrowly defined categories, a smaller num-
ber - generalizing categories, (2) number of annotated terms (AT): topic Appr. Z TTA EC AT Size TS LS AS
property of identified relations between more terms, (3) the average silver 3 420 5 113 23 7.2 7.0 7.2
size of categories: a smaller size indicates more narrowly-defined 3 420 14 108 8 6.9 – 6.9*
categories, whereas a larger size indicates more generally-related 4 363 12 87 7 7.0 – 7.0*
terms in categories; (4) average cross-term score (TS), (5) average HC
5 316 10 73 7 7.0 – 7.0*
label-to-terms score (LS), and (6) average score (AS) between TS and D 7 253 8 52 7 7.2 – 7.2*
LS: the high scores indicate the higher relatedness of the extracted 3 420 26 306 12 4.7 4.2 4.4
terms and extracted and assigned labels. The main goal of our eval- 4 363 22 255 12 5.3 4.6 5.0
uation is to identify which input configurations lead to the highest ANEA
5 316 21 234 11 5.6 5.0 5.3
average score between cross-term and label-to-terms relatedness 7 253 18 179 10 5.7 5.0 5.4
while annotating more TTA into more general categories. silver 3 356 6 57 10 6.2 6.0 6.1
3 356 17 190 11 5.3 – 5.3*
4.2.3 Results. To calculate average relatedness scores, we assigned
4 303 15 152 10 5.5 – 5.5*
the scores from the normalized score matrices to the terms and la- HC
5 255 15 137 9 5.4 – 5.4*
bels of the identified EC and average these scores. Table 2 reports the
S 7 191 12 103 9 5.4 – 5.4*
evaluation results for four datasets and four input configurations.
3 356 21 242 12 4.1 4.4 4.2
The table shows that HC gets the highest average relatedness score
4 303 18 186 10 4.2 3.8 4.0
(𝐴𝑆𝑎𝑣𝑔,𝐻𝐶 = 6.5) that almost reaches silver dataset (𝐴𝑆𝑎𝑣𝑔,𝑠𝑖𝑙 𝑣𝑒𝑟 = ANEA
5 255 15 164 11 4.2 4.4 4.3
6.7) but at the same time produces categories of the size smaller
7 191 10 119 12 5.0 5.3 5.2
than silver categories (𝑆𝑖𝑧𝑒𝑎𝑣𝑔,𝐻𝐶 = 6 and 𝑆𝑖𝑧𝑒𝑎𝑣𝑔,𝑠𝑖𝑙 𝑣𝑒𝑟 = 16).
While on average, ANEA annotates the largest number of terms silver 3 363 6 115 19 7.8 6.7 7.3
(𝐴𝑇𝑚𝑒𝑎𝑛,𝐴𝑁 𝐸𝐴 = 175), it also yields the lowest average relatedness 3 363 19 156 8 7.3 – 7.3*
score (𝐴𝑆𝑚𝑒𝑎𝑛,𝐴𝑁 𝐸𝐴 = 5.2) both compared to the silver dataset and 4 297 16 133 8 7.3 – 7.3*
HC
HC. When creating a coding book, multiple human coders first 5 258 12 105 9 7.3 – 7.3*
annotate, and then the majority voting decides which excepts and T 7 211 9 76 8 7.2 – 7.2*
labels describe a dataset the best. We applied a similar strategy to 3 363 22 239 11 5.4 4.8 5.1
improve the performance of ANEA. 4 297 17 191 11 5.4 4.4 4.9
ANEA
5 258 14 161 12 5.3 4.4 4.9
4.2.4 Voting strategy and default input configuration. Ensemble 7 211 14 137 10 5.6 4.0 4.8
learning is a common approach in machine learning to improve the silver 2 282 7 102 15 6.6 6.2 6.4
results of a classifier, i.e., by combining predictions of multiple clas- 2 282 9 65 7 5.2 – 5.2*
sifiers to achieve a boost in the overall accuracy through collecting 3 227 8 61 8 6.0 – 6.0*
HC
“the wisdom of the crowd” [2]. 4 200 8 61 8 6.0 – 6.0*
We followed this principle to improve the quality of the extracted P 5 183 7 56 8 6.1 – 6.1*
by ANEA categories and combine results of 2-4 configurations 2 282 18 213 12 4.7 4.5 4.6
in each dataset. Similar to the construction of silver datasets, we 3 227 16 172 11 5.3 4.9 5.1
ANEA
created a category of at least five terms if these terms co-occurred in 4 200 16 163 10 5.3 4.8 5.1
at least two input configurations. We assigned a label that describes 5 183 15 149 10 5.2 4.5 4.8
the majority of the terms in the identified group.
Table 3 reports that at least one of the combination of ANEA
with multiple input configurations increases the average relatedness also identifies more generalizing categories (𝐸𝐶𝑎𝑣𝑔,𝐴𝑁 𝐸𝐴 = 15 and
score compared to ANEA without a voting strategy on average by 𝐸𝐶𝑎𝑣𝑔,𝐴𝑁 𝐸𝐴𝑣𝑜𝑡𝑒 = 9).
0.7 (𝐴𝑆𝑎𝑣𝑔,𝐴𝑁 𝐸𝐴𝑣𝑜𝑡𝑒 = 5.9). Although the voting approach does To identify the best default configuration for the voting strategy
not exceed the relatedness scores of the silver datasets (reaches of ANEA, we selected the best performing voting strategy configu-
87.9% of the silver score), it increases a number of annotated terms rations per each dataset and deduced a default input configuration
per category (𝐴𝑆𝑎𝑣𝑔,𝐴𝑁 𝐸𝐴 = 11 and 𝐴𝑆𝑎𝑣𝑔,𝐴𝑁 𝐸𝐴𝑣𝑜𝑡𝑒 = 13) and by generalizing these configurations. We took the minimum and

12
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

Table 3: Evaluation of the entity identification methods
when (entity) categories (EC) are created by majority voting
between input configurations. In each dataset, the majority
voting improves relatedness score for each dataset. The high-
lighted configurations show a range of terms-to-annotate
(TTA) and the resulted best average score (AS).

Conf. Z TTA EC AT Size TS LS AS
silver 420 5 113 23 7.2 7.0 7.2
prev best: 7 253 18 179 10 5.7 5.0 5.4
3+4 363-420 16 160 10 5.7 5.1 5.4
3+4+5 316-420 16 230 14 5.5 4.9 5.2
D
3+4+5+7 253-316 16 247 15 5.4 4.8 5.1
4+5 316-363 16 172 11 6.0 5.5 5.7
Figure 6: Default input configuration: a number of
4+5+7 253-363 15 202 13 5.7 5.1 5.4
terms-to-annotate linearly depends on the number of
5+7 253-316 12 122 10 6.3 5.9 6.1
unique NP heads per dataset.
silver 356 6 57 10 6.2 6.0 6.1
prev best: 7 191 10 119 12 5.0 5.3 5.2
3+4 303-356 10 95 10 4.5 5.0 4.8
3+4+5 255-356 13 173 13 4.4 4.6 4.5 dataset for NER. ANEA imitates the first two stages of a manual
S
3+4+5+7 191-356 11 185 17 4.4 4.4 4.4 annotation process. First, a small set of articles (≈6000-8000 words)
4+5 255-303 8 89 11 4.4 4.5 4.4 is used to automatically identify categories relevant to the data
4+5+7 191-303 10 126 13 4.2 4.8 4.5 of the current domain, i.e., identify and extract related terms and
5+7 191-255 4 44 11 5.6 6.5 6.0 assign a label to each of them (identified 4-12 categories with 44-157
silver 363 6 115 19 7.8 6.7 7.3 assigned terms). Second, a voting strategy is applied, which aims
prev best: 3 363 22 239 11 5.4 4.8 5.1 at increasing the validity of the derived categories by following
3+4 297-363 14 119 9 6.1 5.3 5.7 the idea of ensemble learning and intercoder agreement (related-
3+4+5 258-363 12 146 12 6.2 5.6 5.9 ness score improved on average from 77.4% of the silver average
T
3+4+5+7 211-363 9 171 19 6.1 5.6 5.8 relatedness score to 87.9%). To continue with manual annotation of
4+5 258-297 10 101 10 6.1 5.0 5.6 a (N)ER dataset, next, researchers manually validate their coding
4+5+7 211-297 12 137 11 5.8 5.1 5.4 book. If they find that the coding book sufficiently represents the
5+7 211-258 9 86 10 5.9 5.2 5.6 dataset, they annotate the remaining texts to create a large corpus
silver 282 7 102 15 6.6 6.2 6.4 for NER.
prev best: 3 227 16 172 11 5.3 4.9 5.1 Therefore, the primary use cases of ANEA are as follows. First,
2+3 227-282 12 133 11 5.6 5.6 5.6 extraction of domain categories from a subset of a large text dataset
2+3+4 200-282 14 160 11 5.5 5.2 5.4 and improve their quality with the voting strategy. Second, manual
P
2+3+4+5 181-282 9 157 17 5.7 5.6 5.6 validation and improvement of the identified categories by moving
3+4 200-227 12 128 11 5.5 4.9 5.2 terms between the categories and suggesting better labels to them.
3+4+5 183-227 14 146 10 5.4 4.7 5.0 The final stage of annotation of a NER dataset is to apply a coding
4+5 200-227 13 121 9 5.4 5.0 5.2 book to a large dataset, i.e., read the text and assign categories to
text excerpts following guidelines or examples of the coding book.
Although, such manual text annotation is a standard approach to
the maximum number of terms-to-annotate (TTAs) from the best create “gold”-standard datasets, the recent semi-supervised learning
voting configuration and plot against a number of unique heads in neural network models (e.g., DART [4]) show high potential in
each dataset (see Table 1). reliable annotating of large text collections. We plan to use the
Figure 6 depicts a linear trend between the TTAs and the unique models like DART will complete the automation of creating domain-
heads from the datasets. Based on this trend, for any other dataset, specific NER datasets.
we recommend annotating only the first 𝑦 = 158 + 0.167𝑥 TTAs Future work directions include the creation of manually an-
that belong to the most frequent heads, where 𝑥 is a number of notated datasets from scratch for multiple domains to calculate
unique heads in a dataset. For the voting strategy, we recommend accuracy metrics, e.g., precision, recall, and F1, to evaluate the effec-
using the input configurations of the 𝑦, 𝑦 − 40, and 𝑦 + 40 number tiveness of identifying categories by ANEA. Further, to improve the
of terms that share the most frequent heads of terms. quality and meaningfulness of the assigned labels, we plan to test
ANEA on other knowledge graphs, e.g., Wikidata or BabelNet. To
5 DISCUSSION AND FUTURE WORK test the applicability of ANEA, we also plan to evaluate the approach
Our evaluation shows that ANEA facilitates a faster annotation in other languages, e.g., English, with an additional module for the
process. Specifically, ANEA automatically performs the most time- identification of multi-word expressions similar to compound-based
consuming tasks of deriving a coding book for the annotation of a German words [20]. Further, to improve the semantic quality of

13
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

both categories’ terms and labels in a specific domain, we plan to [8] Édouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomáš
use a language model, e.g., BERT, use the quality score as a learning Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC
objective. Lastly, we seek to build a semi-supervised NER model to 2018).
complete automated annotation of NER datasets, i.e., automatically [9] Timm Heuss, Bernhard Humm, Christian Henninger, and Thomas Rippl. 2014. A
comparison of NER tools wrt a domain-specific vocabulary. In Proceedings of the
annotate large datasets suitable for training neural network models. 10th International Conference on Semantic Systems. 100–107.
We will use terms and labels from the derived categories as the [10] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
seed-terms and seed-labels and perform named entity tagging and standing with Bloom embeddings, convolutional neural networks and incremen-
tal parsing. (2017). To appear.
classify more domain terms [4, 23]. [11] Michael Loster, Felix Naumann, Jan Ehmueller, and Benjamin Feldmann. 2018.
Curex: a system for extracting, curating, and exploring domain-specific knowl-
6 CONCLUSION edge graphs from text. In Proceedings of the 27th ACM International Conference
on Information and Knowledge Management. 1883–1886.
In this paper, we propose ANEA, an automatic approach to derive [12] Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, and Dirk Thomas. 2017.
domain entity categories from a subset of domain input texts, i.e., Improving Company Recognition from Unstructured Text by using Dictionaries..
In EDBT. 610–619.
create a small dataset to train a NER model. Specifically, ANEA [13] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat. 2008. Named entity
identifies related domain-representative terms and automatically recognition approaches. International Journal of Computer Science and Network
Security 8, 2 (2008), 339–344.
extracts and assigns descriptive and generalizing labels to them [14] Christian M Meyer and Iryna Gurevych. 2011. What psycholinguists know about
based on Wiktionary. In our user assessment and evaluation, ANEA chemistry: Aligning Wiktionary and WordNet for increased domain coverage. In
could not outperform a silver dataset on the relatedness scores Proceedings of 5th International Joint Conference on Natural Language Processing.
883–892.
assigned to the groups of terms and labels describing these groups. [15] Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical cluster-
However, ANEA produced more generalizing domain categories ing: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
compared to a strong baseline. We showed that our voting strategy Discovery 2, 1 (2012), 86–97.
[16] David Nadeau, Peter D Turney, and Stan Matwin. 2006. Unsupervised named-
of combining terms and labels from the categories identified at entity recognition: Generating gazetteers and resolving ambiguity. In Conference
multiple input configurations significantly improved the quality of the Canadian society for computational studies of intelligence. Springer, 266–277.
[17] Alexander E Richman and Patrick Schone. 2008. Mining wiki resources for
of the final categories. Additionally, we suggested a default input multilingual named entity recognition. In Proceedings of ACL-08: HLT. 1–9.
configuration that can be applied to derive categories from German [18] Tim Rocktäschel, Michael Weidlich, and Ulf Leser. 2012. ChemSpot: a hybrid
domain text datasets. Finally, we think that the best application of system for chemical named entity recognition. Bioinformatics 28, 12 (2012),
1633–1640.
ANEA is to annotate and use a small dataset in semi-supervised [19] Josef Ruppenhofer, Ines Rehbein, and Carolina Flinz. 2020. Fine-grained Named
learning. Moreover, we plan to improve and validate the annotations Entity Annotations for German Biographic Interviews. (2020).
with a domain expert, and use this small domain dataset to train [20] Bahar Salehi, Paul Cook, and Timothy Baldwin. 2014. Detecting non-
compositional mwe components using wiktionary. In Proceedings of the 2014
state-of-the-art NER models. Conference on Empirical Methods in Natural Language Processing (EMNLP). 1792–
1797.
ACKNOWLEDGEMENT [21] Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003
Shared Task: Language-Independent Named Entity Recognition. In Proceedings
The research for this paper has been conducted in collaboration of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
142–147.
with the company eschbach (https://eschbach.com) supported by [22] Martin Schiersch, Veselina Mironova, Maximilian Schmitt, Philippe Thomas,
the Central Innovation Programme (ZIM) of the German Federal Aleksandra Gabryszak, and Leonhard Hennig. 2018. A German Corpus for
Ministry for Economic Affairs and Energy. Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and
Industry Events. In Proceedings of the Eleventh International Conference on Lan-
We thank all study participants for their significant contribution guage Resources and Evaluation (LREC 2018).
to this publication. [23] Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han. 2018.
Learning Named Entity Tagger using Domain-Specific Dictionary. In Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing.
REFERENCES 2054–2064.
[1] Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D Named [24] Eszter Simon and Dávid Márk Nemeskey. 2012. Automatically generated NE
Entity Annotation for German: Guidelines and Dataset.. In LREC. 2524–2531. tagged corpora for English and Hungarian. Association for Computational Lin-
[2] Michael R Berthold, Christian Borgelt, Frank Höppner, and Frank Klawonn. 2010. guistics.
Guide to intelligent data analysis: how to intelligently make sense of real data. [25] Suzushi Tomori, Takashi Ninomiya, and Shinsuke Mori. 2016. Domain specific
Springer Science & Business Media. named entity recognition referring to the real world by deep neural networks.
[3] Erik Cambria, Soujanya Poria, Rajiv Bajpai, and Björn Schuller. 2016. SenticNet In Proceedings of the 54th Annual Meeting of the Association for Computational
4: A semantic resource for sentiment analysis based on conceptual primitives. In Linguistics (Volume 2: Short Papers). 236–242.
Proceedings of COLING 2016, the 26th international conference on computational [26] Don Tuggener. 2016. Incremental Coreference Resolution for German. Ph. D.
linguistics: Technical papers. 2666–2677. Dissertation.
[4] Ernie Chang, Jeriah Caplinger, Alex Marin, Xiaoyu Shen, and Vera Demberg. [27] Nicolas Weber and Paul Buitelaar. 2006. Web-based ontology learning with isolde.
2020. DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool. In Proc. of the ISWC Workshop on Web Content Mining with Human Language
In Proceedings of the 28th International Conference on Computational Linguistics: Technologies. Citeseer.
System Demonstrations. International Committee on Computational Linguistics [28] Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named
(ICCL), Barcelona, Spain (Online), 12–17. https://doi.org/10.18653/v1/2020.coling- Entity Recognition from Deep Learning models. In Proceedings of the 27th Inter-
demos.3 national Conference on Computational Linguistics. 2145–2158.
[5] Safaa Eltyeb and Naomie Salim. 2014. Chemical named entities recognition: a [29] Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Extracting Lexical
review on approaches and applications. Journal of cheminformatics 6, 1 (2014), Semantic Knowledge from Wikipedia and Wiktionary.. In LREC, Vol. 8. 1646–
17. 1652.
[6] Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, [30] Shaodian Zhang and Noémie Elhadad. 2013. Unsupervised biomedical named
Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervised entity recognition: Experiments with clinical and biological texts. Journal of
named-entity extraction from the web: An experimental study. Artificial intelli- biomedical informatics 46, 6 (2013), 1088–1098.
gence 165, 1 (2005), 91–134. [31] Jie Zhou, Bi-cheng Li, and Gang Chen. 2015. Automatically building large-
[7] John Foley, Sheikh Muhammad Sarwar, and James Allan. 2018. Named Entity scale named entity recognition corpora from Chinese Wikipedia. Frontiers of
Recognition with Extremely Limited Data. arXiv preprint arXiv:1806.04411 (2018). Information Technology & Electronic Engineering 16, 11 (2015), 940–956.