<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building Lightweight Ontologies for Faceted Search with Named Entity Recognition: Case WarMemoirSampo</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikko Koho</string-name>
          <email>mikko.koho@aalto.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Leal</string-name>
          <email>rafael.leal@aalto.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esko Ikkala</string-name>
          <email>esko.ikkala@aalto.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minna Tamper</string-name>
          <email>minna.tamper@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heikki Rantala</string-name>
          <email>heikki.rantala@aalto.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eero Hyvönen</string-name>
          <email>eero.hyvonen@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Helsinki Centre for Digital Humanities (HELDIG), University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Semantic Computing Research Group (SeCo), Aalto University</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper discusses building lightweight ontologies for faceted search user interfaces with Named Entity Recognition (NER) from textual data. This is studied in the context of building a Knowledge Graph for the textual indexing of interview videos in the in-use WarMemoirSampo system, consisting of a Linked Open Data service and an open semantic web portal for contextualized video viewing. It is shown that state-of-the-art NER tools are able to find entities from textual data and categorize them with high enough recall and precision to be useful for building facet ontologies, without involving considerable manual domain ontology engineering. To enable entity disambiguation and to be able to show relevant contextual information and useful links for the users of the portal, also Named Entity Linking techniques are employed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Faceted Search</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Ontologies</kwd>
        <kwd>Named Entity Linking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the 1930s, S. R. Ranganathan introduced the idea of faceted classification in Library Science.
Related to this idea, the faceted search paradigm [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], called also view-based search [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
dynamic taxonomies [4], is based on indexing data items along orthogonal category hierarchies,
i.e., facets (e.g., subject matter, places, times, etc.). The user can select categories on the facets
in free order, and the data items included in the selected categories are considered the search
results. After a selection, a count is calculated for each category, showing the number of results
for that selection, which is useful for guiding the search. In [5], ontologies that interlink the
data in a semantic web application were used as a basis for associating data to the facets. The
paradigm has been found suitable for, e.g., Semantic Web user interfaces in cultural heritage
(CH) applications, such as [6, 7, 8], and various tools for faceted Linked Data-based search and
browsing [9] have been developed, such as /facet [10], SPARQL Faceter [11], and Sampo-UI [12].
      </p>
      <p>In the Linked Data [13] context, applications typically use pre-defined ontologies [ 14] for
facets, the same that have been used for annotating the underlying metadata. For example, in [8]
the vocabularies1 of the Getty Research Institute were employed. Ontologies can also be created
in many ways, typically involving domain experts in manually engineering the ontologies for
some specific needs [ 15]. In many cases such vocabularies or ontologies may not be readily
available, and manually creating one might not be feasible, for example, when the metadata
to be searched has not been structured (like is the case in, e.g., museum collections) but are
available only in textual form. The vocabularies then have to be constructed bottom-up from
the textual data.</p>
      <p>In this paper, we argue that it is possible to build lightweight ontologies [16, 17] by applying
state-of-the-art Named Entity Recognition (NER) [18] methods to textual data to find named
entities of diferent categories, and that these lightweight ontologies can be used to enable
the search, exploration, and analysis of data on a faceted search user interface. We show how
lightweight facet ontologies can be built from the entities found with NER and the textual
objects linked to the created ontologies to enable ontology-based information retrieval.</p>
      <p>These topics are studied in the context of WarMemoirSampo, which is a Linked Open Data
(LOD) resource of Finnish Second World War (WW2) veteran interview videos [19], as well
as an in-use semantic portal2 for easy access to them. It hosts a collection of 159 videos, with
approximately 400 hours of playtime in total, where veterans mostly reminisce about their lives
during and after the wartime. WarMemoirSampo is realized by combining NER and Named
Entity Linking (NEL) [20, 21, 22] on the video content descriptions, and enriching video metadata
with related information from the linked entities in the WarSampo knowledge graph [23] and
Wikidata [24]. The original video content descriptions consist of free-form notes in Finnish,
which are time-indexed to enable finding video segments based on the discussed topics. In
WarMemoirSampo, the videos and their segments are indexed [25, 26] with the extracted
metadata.</p>
      <p>The paper complements our earlier papers about WarMemoirSampo: [27] presents the
general concept of publishing, searching, and watching the interview videos using Linked Data
while [19] gives an overview of the data and text processing and of building the semantic portal
to make use of the data.</p>
      <p>The paper is structured as follows. Section 2 presents related work relating to the topics of
this research. Section 3 discusses the data used in WarMemoirSampo. Section 4 explains the
implementation of NER and NEL for building the facet ontologies, which are evaluated in Section 5.
In Section 6, the ontologies are shown in use in the faceted search of the WarMemoirSampo
portal. Section 7 summarizes the results of the paper.</p>
      <sec id="sec-1-1">
        <title>1https://www.getty.edu/research/tools/vocabularies/ 2The portal: https://sotamuistot.arkisto.fi/</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Digitized cultural heritage documents and artefact collections have inspired many scholars in
creating applications for searching, browsing, and recommending data. Typically the dataset
metadata is utilized in the search applications to find artefacts based on their metadata
similarly [28, 29, 30]. In addition, enriching the metadata using knowledge extraction methods is
sometimes possible depending on the collection. For instance, Europeana3 collections consist
of more than 50 million CH objects provided by partner institutions across Europe [
        <xref ref-type="bibr" rid="ref4 ref5">31, 32</xref>
        ].
The Europeana portal provides users with tools to find interesting objects from their digitized
collections. This includes also utilizing named entities to recommend content. Similarly, NER
has been used in archaeology and war history to build data search applications [
        <xref ref-type="bibr" rid="ref6 ref7">33, 34</xref>
        ]. In the
case of archeology, in addition to using named entities, Brandsen et al. [
        <xref ref-type="bibr" rid="ref6">33</xref>
        ] have evaluated
BERT-based NER for information retrieval with an archaeological text collection, including
advanced search capabilities.
      </p>
      <p>
        There are several NER tools and models for extracting named entities from Finnish texts.
Lately, the three most notable tools are StanfordNER [
        <xref ref-type="bibr" rid="ref8">35</xref>
        ], FiNER [
        <xref ref-type="bibr" rid="ref9">36</xref>
        ], and FinBERT’s NER
models [
        <xref ref-type="bibr" rid="ref10">37</xref>
        ]. Out of the three, the BERT-based models for Finnish are evaluated as the most
accurate so far [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">37, 38, 39</xref>
        ].
      </p>
      <p>
        On lemmatization, the more traditional method is based on finite state transducers (FST),
for which Omorfi (Open morphology for Finnish) [
        <xref ref-type="bibr" rid="ref13">40</xref>
        ] is a prime example for the Finnish
language; another one, Voikko4, has been in development for many years. A newer method,
which uses deep neural networks, is exemplified by the TurkuNLP Neural Parser pipeline
[
        <xref ref-type="bibr" rid="ref14">41</xref>
        ]. WarMemoirSampo employs both methods, using FST to correct errors made by neural
networks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. WarMemoirSampo Data</title>
      <p>The WarMemoirSampo knowledge graph is built from source data that was available in
spreadsheet format (CSV). The source data consists of video and interview metadata with textual
descriptions of video contents. The core component of the metadata are spreadsheet tables
with timestamped textual descriptions of the things being discussed on the interview videos,
divided into temporal segments of various lengths. The free-form notes in Finnish and the
video segmentation are created originally at the time of the interviews, and later transformed
into structured data. The videos and their metadata were provided by the veteran organization
Tammenlehvän Perinneliitto and the National Archives of Finland.5</p>
      <p>The interviews were carried out in Finnish, which is more challenging for natural language
processing than for example English: its rich morphology6 is burdensome for lemmatization
and Named Entity Recognition (NER). Currently, Finnish is one of the most well-resourced
3https://www.europeana.eu/
4https://voikko.puimula.org/
5More information about the WarMemoirSampo project can be found at: https://seco.cs.aalto.fi/projects/
war-memoirs/en</p>
      <p>
        6Finnish contains around 15 cases and various sufixes, which result in a number of surface forms that may
reach well over 2000 for a single word.
languages for Natural Language Processing (NLP) [
        <xref ref-type="bibr" rid="ref15">42</xref>
        ]. However, since the interviews were
realized in colloquial, dialectal Finnish, not even the state-of-the-art speech-to-text algorithms
proved robust enough to cope with them: according to our test, only the standardized general
language (yleiskieli) produces adequate results presently, in contrast to colloquial and regional
variants7.
      </p>
      <p>The free-form notes made by the interviewers during the conversations were used as the
primary textual data. They do vary in scope and were not originally meant to be used as
sources of information the way WarMemoirSampo requires. Their sentences are not necessarily
complete and grammatically correct; abbreviations also abound. This aggravates the challenge
posed by Finnish, especially regarding lemmatization and entity recognition.</p>
      <p>The source data also contains interviewees’ name, date and place of birth, length and place
of the interviews, links for the respective interview videos on the YouTube platform, and other
metadata. The notes for each interview are divided into rows, roughly corresponding to a
sentence; they also feature a timestamp related to their location in the YouTube video. These
timestamps are nevertheless not unique, since several consecutive rows usually present the
same timestamp, marking the start of the first row in the group. For WarMemoirSampo these
rows were grouped together, resulting in several minutes long video segments, which became
our main data unit for this project.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Building the Facet Ontologies</title>
      <p>The facet ontologies are mostly built by applying NER to the source data, which extracts
categorized entities mentioned in the interview notes. This approach is further advanced by
applying two separate pre-existing NEL tools on the interview notes, which link to Wikidata
and to the WarSampo LOD service on Finnish Second World War history [29, 23] based on the
mentions of known entities. The idea for using the NEL approaches in addition to NER, is to
be able to show links to further information about the topics that are being discussed in the
interviews at a certain time, and also to enrich entity information with alternative labels to be
able to automatically reconcile known duplicate entities. For example, a person can be referred
to with multiple names which is impossible for NER to handle, as is the case for Mr. “Aarne
Juutilainen”, also known and mentioned as “Saharan kauhu” (“The dread of Sahara”). After the
NER and NEL steps, the created entities are reconciled into six separate lightweight ontologies.
This process is depicted in Fig. 1.</p>
      <sec id="sec-4-1">
        <title>4.1. NER Entities and Categories</title>
        <p>
          Named entity recognition was performed using the Secompling library8, which aims at providing
an integrated interface for various Natural Language Processing (NLP) tools in Finnish. The
module responsible for lemmatization and NER employs two tools developed by the TurkuNLP
research group9: Neural Parser pipeline [
          <xref ref-type="bibr" rid="ref16">43</xref>
          ] and Finnish NER [
          <xref ref-type="bibr" rid="ref10">37</xref>
          ], which are based on the BERT
        </p>
        <sec id="sec-4-1-1">
          <title>7There is an ongoing efort to collect diferent kinds of Finnish speech: www.lahjoitapuhetta.fi 8https://version.aalto.fi/gitlab/seco/secompling 9https://turkunlp.org/</title>
          <p>
            model FinBERT [
            <xref ref-type="bibr" rid="ref11">38</xref>
            ], also developed by the TurkuNLP group. The parser outputs CoNLL-U10
documents, which include lemma, part-of-speech (POS) tags, and dependency relations. The
NER tool labels each word with an IOB tag to mark the Inside, Outside or Beginning of the
entities, and a category tag.
          </p>
          <p>Secompling combines the results obtained via both tools in order to fix errors made by either
one. For example, at times their tokenization difers: a word that is split by the NER tool is not
split by the parser. This may create conflicts if the latter marks it as a proper name but the
former does not provide tags for any of its parts. On the other hand, a single entity may be
assigned diferent tags. Wrongly tokenized words may also afect the quality of the POS tags. It
is possible to fix some of these errors by aligning the results and combining them heuristically.
A total of 1220 documents were altered to produce better parsing. In almost all cases, tokens
10https://universaldependencies.org/format.html
were split so that they started with a punctuation mark and were assigned the wrong POS tag.
Separating these punctuation marks improved the results.</p>
          <p>Aligning the results also provides lemmatized forms for the recognized named entities. This
is indispensable, since the NER tool only tags words found in the text without analyzing them
any further. However, the morphological richness of Finnish means that words very often do
not appear in their basic form, so they must be lemmatized. The lemmatization module in
Secompling uses the third-party libraries Voikko and uralicNLP (whose backend is Omorfi) in
order to check and possibly fix the basic forms provided by the parser, using their POS as input.</p>
          <p>Out of the 18 categories recognized by the NER tool, only the following were used as classes:
Person, Product, Organization, Event, and a combination of Location (LOC), Facilities (FAC)
and Geopolitical entities (GPE) as Place. An attempt is made to internally link person entities
containing only one word (presumably either first name or surname) to corresponding
fullnamed entities previously detected in the same text; failing that, they are ignored. Moreover,
military units were systematically removed from the results, since the recognition provided by
the Warsa-linkers tool used in WarSampo was deemed suficient for this project.</p>
          <p>
            Specific lemmatization heuristics are applied to named entities, since they are often formed by
more than one word with diferent inflections 11. A simple ad-hoc evaluation with 100 random
entities indicates that the Neural parser has a lemmatization accuracy of 0.84 when it comes
to named entities, which is a significant depart from the scores of 0.951–0.972 reported in [
            <xref ref-type="bibr" rid="ref14">41</xref>
            ]
for datasets of texts in Finnish. This is probably due to the overall lack of multi-word entity
lemmatization and the typographic errors in the texts. Secompling raises this entity-specific
score to 0.96. However, at this point Secompling is not able to perform external linking, therefore
it cannot diferentiate accurate named entities from errors in recognition and/or lemmatization.
As a compensatory measure, manual correction of entities is used, so that entities can be
removed, or their lemma and/or category can be changed.
          </p>
          <p>During the transformation a set of correction rules were applied to the results in order to
remove and correct the entities to be shown in facets: Ambiguous single-name person entities
were discarded, which amount to over 500. A total of 336 entities were blacklisted, including 182
organizations (which overlap significantly with locations, when the entities are facilities such
as hospitals, airports, etc.; they might be recognized as either one), 63 locations (mostly related
to military units), 30 events (many overlap with locations) and 61 products (mostly deemed
irrelevant). For four entities the category was changed, in 221 cases labels were changed, and in
45 cases both the category and the label had to be changed.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Parallel NEL Linking to Wikidata and WarSampo</title>
        <p>In addition to the NER process, two separate NEL processes are used directly on the source data.
The NEL processes are based solely on the texts in the interview notes.</p>
        <p>WarSampo. The WarSampo system includes a data and ontology infrastructure for
representing Finland in World War II. The WarSampo knowledge graph (KG) [23] consists of around
100 000 persons, 51 000 places, 16 000 military units, 166 000 photographs, 26 000 war diaries,
and 3000 war veteran memoir articles, among others, with large amounts of links between
11For example, genitive followed by nominative, as in Helsingin päärautatieasema ’Helsinki Central Station’
entities. The persons contains the casualty register [44] of the National Archives of Finland,
containing detailed information about all 94 700 people killed in action in Finland during WW2,
as well as a person register of all of the 4200 Finnish prisoners of war [45] and 5600 notable
individuals from other data sources (Wikipedia, etc.) [46]. WarSampo also uses the oficial
Place Name Register of Finland by National Survey as an external domain ontology, served on a
separate SPARQL endpoint.</p>
        <p>In WarSampo, textual event and photograph descriptions were linked to persons, places,
and military units using the Warsa-linker tool [47]. In WarMemoirSampo, Warsa-linker was
re-used to link mentions of persons, places, and military units to corresponding entities in the
WarSampo KG. After the NEL phase, metadata of the linked entities was pulled with SPARQL
queries from the WarSampo KG, and new entities were created to WarMemoirSampo based on
the retrieved metadata, with links to the original entities.</p>
        <p>Wikidata. In order to link and identify entities with the Wikidata knowledge base, the
named entity linking tool Nelli [48, 49] was utilized. In the WarMemoirSampo context, Nelli
is configured to first lemmatize texts using the Turku Neural Parser pipeline and then to link
the results with ARPA [50]. ARPA is a configurable entity linking tool that can be configured
for diferent services; in the case of WarMemoirSampo, the tool was configured to link places,
units, and people to Wikidata.</p>
        <p>In the interviews, the veterans mention some of their comrades in arms and superiors. It is
easy to identify and link the most notable oficers, however, disambiguating regular soldiers
is hard. The latter are often mentioned by only first name and/or surname and there can be
several namesakes, so that more information would have been necessary to diferentiate them.
Therefore, the linking strategy is centered around notable figures present in Wikidata.</p>
        <p>In terms of place linkage, several Wikidata ARPA configurations are used to link places
such as continents, countries, cities, as well as smaller places such as towns and villages. It
can be challenging to match former Finnish place names to Wikidata due to varying practices
in classifying place entities (e.g., towns or urban settlements) or missing labels, e.g., entities
Paanajärvi or Muolaa in Wikidata. Moreover, places that were annexed to Russia often changed
their names from Finnish to Russian, and some place names were lacking their original names
mentioned in the interviews. When such clear shortcomings in Wikidata were identified by
manual inspection, we contributed to Wikidata by adding the missing Finnish names.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Reconciling Entities Between NER and NEL</title>
        <p>After applying NER and two diferent NEL tools, there are a large number of duplicate entities.
Most of these are due to the diferent processes finding entities for the same words in the text,
whereas in some cases one entity is referred to with multiple names. The numbers of entities
found by class and source are given in Table 1. The “Entity Class” column corresponds to
the entity class, “NER” corresponds to the number of entities created with the NER process,
“WarSampo” and “Wikidata” correspond to the number of entities created from the respective
NEL sources.</p>
        <p>The generated named entities from diferent steps are disambiguated between the diferent
data sources based on 1) matching the sets of URIs received for them from LOD data sources,
and 2) matching entity types and names. The entities are then reconciled by merging the found
duplicates. The numbers of disambiguated and reconciled named entities per entity class are
shown in Table 1 column “Reconciled Entities”.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluating the Facet Ontologies</title>
      <p>The six facet ontologies consist of entities of the corresponding entity class with label and
source information, but no links between the entities. The entities from the NEL processes
also contain links to the entities in external systems and metadata pulled from them. The facet
ontologies by entity class are:
1. Places contain 1840 place entities (found in the notes of all 159 interviews) from NER,
WarSampo, and Wikidata. NER recognizes a broad range of places into this category,
including countries, municipalities, cities, towns, geographical formations like islands and
rivers, and buildings such as churches, train stations, and airfields. The WarSampo linking
links to all place types in the WarSampo KG, consisting of municipalities, cities, and
various types of smaller geographical locations. The Wikidata linking links to continents,
countries, cities, and municipalities. The place entities with most links from the interviews
are important locations in the war: The Karelian Isthmus, Finland, Helsinki, Vyborg, Svir.
On the other hand, there is a long tail of places with varying relevance and some clear
errors.
2. Persons contain 310 person entities (found in the notes of 126 out of 159 interviews)
from NER, WarSampo, and Wikidata. The most referenced entities are prominent figures
during the wartime, including, e.g., the Finnish commander-in-chief C.G.E. Mannerheim,
Joseph Stalin, and Finnish commissioned oficers. Due to the rather small size, it was
possible to clean this facet ontology of erroneous entities. However, one of the most
referenced entities “Ehnrooth” is actually a group of several prominent Finnish
commissioned oficers with that surname. In some cases the first names are used as well and these
are disambiguated to a specific Ehnrooth. In other cases, such as mentions containing
only a surname, it is not possible to disambiguate person entities, except in cases like
“Mannerheim”, which can be disambiguated with high confidence.
3. Military Units contain 117 entities (found in the notes of 122/159 interviews) from
WarSampo and Wikidata. The entities are mostly diferent infantry regiments, but also
contain some often referenced general entities like reserve oficer school, headquarters,
and air force. As they come only from applying NEL to curated contents, there are no
clear errors present and the entities don’t seem to contain duplicates.
4. Organizations contain 610 entities (found in the notes of 155/159 interviews) from
NER and Wikidata. Almost all of the entities come from NER and the scope of the
organizations seems to be very broad, including a few often referenced entities like the
White Guard, Wafen-SS, and the Finnish War Veteran Association. In addition to the few
entities with a large amount of mentions, there is a long tail of organizations, such as
schools, companies, museums, travel agencies,and clubs, for many of which it is dificult
to ascertain their identity without listening to how they are referenced in the actual
interviews and trying to find information from external sources. Some of these seem like
errors in NER categorization, but could be names of, e.g., companies instead. Due to the
high number of entities and the efort it takes to study their identities, this ontology was
left rather unfinished, but with some of the clear errors removed. Also the NER assigned
many military units into this class, which were removed due to them being found best
with the WarSampo linking.
5. Events contain 66 entities (found in the notes of 151/159 interviews) from NER. The most
referenced entities are very relevant: Winter War, Continuation War, Lapland War, and
Civil War. Half of the entities are battles, usually named after the place of occurrence. As
the number of entities is low, it was easy to manually curate the list.
6. Products contain 51 entities (found in the notes of 25/159 interviews) from NER. These
have a low number of references overall, and there are no entities with high number of
references. The ones with most references (3) are military aircraft makes and models.
There are also car brands, medals, and a variety of entities, which, like in the case of
organizations, would require plenty of efort to study their identities, and some of them
are probably errors.</p>
      <p>Due to the fully separate NER and NEL processes, in some cases we ended up creating multiple
named entities from one entity mention in the text. This occurs when locations like “Iisalmi
church” are mentioned, of which we get named entities for both “Iisalmi” (a municipality) and
“Iisalmi church”.</p>
      <p>After the ontologies were initially created and the WarMemoirSampo Portal set up, it was
possible to observe the facet ontologies in use and to spot errors and irrelevant entities. Several
iterations of adjusting the entity blacklists were done at this point. In this iterative process of
improving the ontologies, it was found out that in some cases the inspection of whether an entity
is having the right class required listening to the interviews to get the context what is being
discussed. This would have been diferent had we been working with the actual transcriptions
of the interviews instead of concise and varying notes.</p>
      <p>The results of this process were evaluated by inspecting the final named entities assigned
to twenty random interview segments. They contain a total of 99 recognized entities. Out of
these, 88 were exact identifications and three entities were doubly recognized, as explained
above. Additionally, two entities were mistakenly linked12, two mistakenly recognized and four
received the wrong category; and five were not recognized. This adds up to a Precision of 0.919
12One entity was disregarded, since a typographical error led to a wrong identification: Turkin saari instead of
Turkinsaari. Entity without links (a total of 5) were considered correct.
(or 0.889 when disregarding double recognitions) and a Recall of 0.875 (0.846), and an F1 score
of 0.897 (0.867).</p>
      <p>A more robust entity linking system could avoid mistaken links, but due to disambiguation
efects it could also help avert erroneous entities and categories. Improving the lemmatization
of named entities could also raise this score in case some entities are not recognized due to their
surface form.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Faceted Search User Interface</title>
      <p>Sampo-UI [12] is a JavaScript framework for building web-based user interfaces for KGs
published in SPARQL endpoints. The ideas and design choices behind Sampo-UI have been gradually
developed in the context of the general Sampo model [51], which have guided the
implementation of a series of public semantic portal prototypes13 since 2002, mostly in the cultural heritage
domain. As faceted search is the key search paradigm in Sampo-UI, it was a natural tool of choice
for implementing a user interface that utilizes the new facet ontologies which are published
within the WarMemoirSampo KG.</p>
      <p>In the current version of Sampo-UI, a new user interface is implemented by writing a set
of JavaScript Object Notation (JSON) configuration files 14 accompanied with corresponding
SPARQL queries15, which are then automatically used as input parameters for assembling a
fully functional modern web application called WarMemoirSampo Portal16, which uses data
directly from the WarMemoirSampo SPARQL endpoint17.</p>
      <p>The simplest solution for delivering the interview videos to users would be a clickable list of
videos with thumbnails, possibly with an option to search for the titles of the videos. The aim of
the WarMemoirSampo Portal is to go far beyond this by enabling search functionalities that
can dig out specific segments inside the long interviews. This way the user can start watching
an interview directly at a point where an entity of interest (e.g. person, place, military unit)
is mentioned. In addition to enhanced search tools, the underlying facet ontologies make it
possible to provide rightly timed contextual links when the user is watching the video. Using
the links the user can learn more18 about the entities mentioned in the current segment of the
interview video.</p>
      <p>At the front page of the portal the user can choose between three distinct faceted search
perspectives:
1. Interviews perspective for searching whole interviews using a free combination of 10
facets.
2. Segments perspective for searching specific segments inside the interviews with same
facets as in the Interviews perspective.
13See the full list of semantic portals based on the Sampo model at https://seco.cs.aalto.fi/applications/sampo.
14https://github.com/SemanticComputing/veterans-web-app/tree/master/src/configs
15https://github.com/SemanticComputing/veterans-web-app/tree/master/src/server/sparql/veterans/sparql_
queries
16The portal is published at https://sotamuistot.arkisto.fi
17The SPARQL endpoint is available at https://ldf.fi/warmemoirsampo/sparql
18Links to WarSampo and Wikipedia are provided as explained in subsection 4.2.
3. Directory perspective for searching all automatically recognized named entities mentioned
on in the interview notes, and accessing the segments where the entity is mentioned.</p>
      <p>Fig. 2 illustrates how the five video segments in which the place Äänislinna (Petrozavodsk) and
the military unit JR 50 are both mentioned can be found using two facet selections. In addition
to faceted search, the underlying facet ontologies can be used for implementing exploratory
search functionalities. As an example of this, Fig. 3 portrays how the 4566 distinct geographical
references within the segments are plotted on an interactive map. Clicking a marker on the
map opens a popup which provides links to all segments where the place is mentioned.</p>
      <p>Faceted search ofers a few key benefits to the users of WarMemoirSampo that would be
missing if only traditional text search would be available. Notably, with faceted search the
user does not need to know what she is searching for. This makes it possible to quickly find
unexpected interesting things in the interviews. For example while browsing through the values
of the Mentioned Product facet the user may pay attention that the famous Finnish spreadable
cheese Koskenlaskija is mentioned in one interview segment. By clicking the search result the
user can watch the segment, which happens to include a story about Koskenlaskija cheese
saving a person’s life19.</p>
      <p>Because the faceted search shows the user the hit counts for each option in the facets, the
user can also get a general idea about the things that the interviewees are talking about with
almost a single glance. For example, in the Mentioned place facet most hits are, unsurprisingly,
19The interviewed veteran gave his comrade a Koskenlaskija cheese packet and the comrade moved away from
his normal battle station to eat the cheese, just before a bomb hit and killed the other person in that station.</p>
      <p>Results: 5 segments that mention the place Äänislinna AND military unit JR 50
10 facets for
semantic
searching</p>
      <p>Links to
segments where</p>
      <p>Venezuela is
mentioned
certain places in Karelia where most important events of the war took place, and where most
of the soldiers were fighting. Germany is also mentioned quite often, interestingly more often
than Russia or Soviet Union. On the other hand, when looking at places with only one or few
hits, the user can find some surprising and interesting entities, such as Brazil mentioned in an
interview. While this kind of comparison can be useful, it should not be taken too seriously, as
the entities are disambiguated only based on the entity name and class. For example, for some
place names, there are actually multiple places with the same name and in these cases they are
grouped together as one entity. The main goal of finding named entities in WarMemoirSampo
is to facilitate the search, which makes these issues less relevant than they would be in some
other cases.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we have presented the idea of building lightweight ontologies for faceted search
with NER, and shown how state-of-the-art NER and NEL tools ware used in the
WarMemoirSampo system to create good enough lightweight ontologies to provide facet options in a
faceted search user interface. This approach follows the trend in Semantic Web research to
shift from the heavy use of formal semantics towards leveraging the collection of distributed,
heterogeneous data using light-weight ontologies [52].</p>
      <p>In the entity recognition task, we achieved an F1 score of 0.90 for a small sample of 20
interview segments, discounting wrong links, entities and categories. Our lemmatization task
had accuracy of 0.96 within a sample of 100 entities. Improving lemmatization could have
positive knock-on efects on NER; so could enhancing the linking system.</p>
      <p>We have shown how to find entities from Finnish textual data and categorize them with
high enough recall and precision to be useful for building facet ontologies in practice, without
involving considerable manual domain ontology engineering. However, the created facet
ontologies would still benefit from some post-processing to remove erroneous and non-relevant
entities, and from organizing them into hierarchies, which would facilitate search using higher
level categories. However, in our case this was deemed not necessary as the number of entities
is not very large.</p>
      <p>The WarMemoirSampo data is relatively small, containing ca. 323 000 triples. The tools used
in this study would scale up for a few times larger datasets without any efort. However, the
NER tool, which is the key component for our approach, would scale even better. The amount
of manual work required with the facet ontologies might also increase as the source data size
increases, but the extent of this would depend on the data. In the future, we will further research
this approach of using NER to extract entities for building facet ontologies in diferent contexts.</p>
      <p>In the cultural heritage domain, transforming data into Linked Data often requires manually
curated data transformation processes that align the data with the used data models and
ontologies according to defined rules [ 6, 8, 53, 23]. The ontologies are either manually engineered
or in some cases pre-existing vocabularies and ontologies may be used that fit the needs of
the data publication. The approach of building facet ontologies with NER presented in this
paper could significantly reduce the efort required to create Linked Data from textual datasets.
As facet ontologies can be built from data bottom-up, it becomes easier to implement faceted
search for searching, browsing and visualizing the data.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Ilpo Murtovaara, Markus Merenmies, and Kare Salonvaara contributed in creating the videos
and their metadata used in the work of this paper. Tammenlehvän Perinneliitto ry with the
National Archives of Finland funded our work. The work is partly related and funded by the EU
project InTaVia: In/Tangible European Heritage20, and the EU COST action Nexus Linguarum21
on linguistic data science. The authors acknowledge CSC – IT Center for Science, Finland, for
providing for computational resources.
20https://intavia.eu/
21https://nexuslinguarum.eu/the-action
[4] G. M. Sacco, Dynamic taxonomies for intelligent information access, in: M. Khosrow-Pour,
D.B.A. (Ed.), Encyclopedia of Information Science and Technology, Third Edition, Hershey,
PA: IGI Global, 2015, pp. 3883–3892. doi:10.4018/978-1-4666-5888-2.ch382.
[5] E. Hyvönen, S. Saarela, K. Viljanen, Application of ontology techniques to
viewbased semantic search and browsing, in: The Semantic Web: Research and
Applications. Proceedings of the First European Semantic Web Symposium (ESWS 2004), 2004.
doi:10.1007/978-3-540-25956-5\_7.
[6] E. Hyvönen, E. Mäkelä, M. Salminen, A. Valo, K. Viljanen, S. Saarela, M. Junnila, S. Kettula,
MuseumFinland—Finnish museums on the Semantic Web, Journal of Web Semantics 3
(2005) 224–241. doi:10.1016/j.websem.2005.05.008.
[7] E. Mäkelä, E. Hyvönen, T. Sidorof, View-based user interfaces for information retrieval
on the Semantic Web, in: Proceedings of the ISWC 2005 Workshop on End User Semantic
Web Interaction, CEUR Workshop Proceedings, 2005. Vol. 172.
[8] G. Schreiber, A. Amin, L. Aroyo, M. van Assem, V. de Boer, L. Hardman, M. Hildebrand,
B. Omelayenko, J. van Osenbruggen, A. Tordai, J. Wielemaker, B. Wielinga, Semantic
annotation and search of cultural-heritage collections: The MultimediaN E-Culture
demonstrator, Journal of Web Semantics 6 (2008) 243–249. doi:10.1016/j.websem.2008.08.001.
[9] Y. Tzitzikas, N. Manolis, P. Papadakos, Faceted exploration of RDF/S datasets: a
survey, Journal of Intelligent Information Systems 48 (2017) 329–364. doi:10.1007/
s10844-016-0413-8.
[10] M. Hildebrand, J. Van Ossenbruggen, L. Hardman, /facet: A browser for heterogeneous
Semantic Web repositories, in: I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe,
P. Mika, M. Uschold, L. M. Aroyo (Eds.), The Semantic Web – ISWC 2006: 5th International
Semantic Web Conference, volume 4273 of Lecture Notes in Computer Science, Springer,
Berlin, Heidelberg, 2006, pp. 272–285. doi:10.1007/11926078\_20.
[11] M. Koho, E. Heino, E. Hyvönen, SPARQL Faceter – client-side faceted search based on
SPARQL, in: Joint Proceedings of the 4th International Workshop on Linked Media and the
3rd Developers Hackshop, CEUR Workshop Proceedings, 2016. URL: http://www.ceur-ws.
org/Vol-1615, vol. 1615.
[12] E. Ikkala, E. Hyvönen, H. Rantala, M. Koho, Sampo-UI: A full stack JavaScript framework
for developing semantic portal user interfaces, Semantic Web – Interoperability, Usability,
Applicability 13 (2022) 69–84. doi:10.3233/SW-210428.
[13] C. Bizer, T. Heath, T. Berners-Lee, Linked Data – the story so far, International Journal
on Semantic Web and Information Systems (IJSWIS) 5 (2009) 1–22. doi:10.4018/jswis.
2009081901.
[14] T. R. Gruber, A translation approach to portable ontology specifications, Knowledge
acquisition 5 (1993) 199–220.
[15] M. C. Suárez-Figueroa, A. Gómez-Pérez, M. Fernández-López, The NeOn methodology for
ontology engineering, in: Ontology Engineering in a Networked World, Springer, Berlin,
Heidelberg, 2012, pp. 9–34. doi:10.1007/978-3-642-24794-1\_2.
[16] C. Fluit, M. Sabou, F. Van Harmelen, Supporting user tasks through visualisation of
light-weight ontologies, in: Handbook on Ontologies, Springer, Berlin, Heidelberg, 2004,
pp. 415–432.
[17] F. Giunchiglia, I. Zaihrayeu, Lightweight ontologies, in: L. Liu, M. T. Özsu (Eds.),
Encyclopedia of Database Systems, Springer, Boston, MA, 2009, pp. 1613–1619. doi:10.1007/
978-0-387-39940-9\_1314.
[18] L. Buitinck, M. Marx, Two-stage named-entity recognition using averaged perceptrons,
in: G. Bouma, A. Ittoo, E. Métais, H. Wortmann (Eds.), Natural Language Processing and
Information Systems, volume 7337 of Lecture Notes in Computer Science, Springer Berlin
Heidelberg, 2012, pp. 171–176.
[19] R. Leal, H. Rantala, M. Koho, E. Ikkala, M. Merenmies, E. Hyvönen, WarMemoirSampo: A
semantic portal for war veteran interview videos, in: 6th Digital Humanities in Nordic
and Baltic Countries Conference, short paper, 2022. Forth-coming.
[20] W. Shen, J. Wang, J. Han, Entity linking with a knowledge base: Issues, techniques, and
solutions, IEEE Transactions on Knowledge and Data Engineering 27 (2014) 443–460.
doi:10.1109/TKDE.2014.2327028.
[21] B. Hachey, W. Radford, J. Nothman, M. Honnibal, J. R. Curran, Evaluating entity linking
with Wikipedia, Artificial Intelligence 194 (2013) 130–150. doi: 10.1016/j.artint.2012.
04.005.
[22] R. C. Bunescu, M. Pasca, Using encyclopedic knowledge for Named Entity Disambiguation,
in: Proceedings of the 11th Conference of the European Chapter of the Association for
Computational Linguistics (EACL-06), volume 6, Association for Computational Linguistics,
Trento, Italy, 2006, pp. 9–16.
[23] M. Koho, E. Ikkala, P. Leskinen, M. Tamper, J. Tuominen, E. Hyvönen, WarSampo
Knowledge Graph: Finland in the Second World War as Linked Open Data, Semantic Web –
Interoperability, Usability, Applicability 12 (2021) 265–278. doi:10.3233/SW-200392.
[24] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, D. Vrandečić, Introducing Wikidata to
the Linked Data web, in: The Semantic Web – ISWC 2014, volume 8796 of Lecture Notes in
Computer Science, Springer, Cham, 2014, pp. 50–65. doi:10.1007/978-3-319-11964-9\
_4.
[25] J. Hunter, R. Iannella, The application of metadata standards to video indexing, in:
C. Nikolaou, C. Stephanidis (Eds.), Research and Advanced Technology for Digital Libraries,
Springer, Berlin, Heidelberg, 1998, pp. 135–156.
[26] C. Ribeiro, M. L. Mucheroni, Dynamic indexation in video metadata, Procedia-Social and</p>
      <p>Behavioral Sciences 73 (2013) 551–555.
[27] E. Hyvönen, E. Ikkala, M. Koho, R. Leal, H. Rantala, M. Tamper, How to search and
contextualize scenes inside videos for enriched watching experience: Case stories of
the Second World War veterans, in: Proceedings of the 19th Extended Semantic Web
Conference (ESWC 2022), Poster and Demo papers, Springer, 2022. URL: https://seco.cs.
aalto.fi/publications/2022/hyvonen-et-al-wms-2022.pdf, forth-coming.
[28] S. Dumont, correspSearch – connecting scholarly editions of letters, Journal of the Text</p>
      <p>Encoding Initiative 10 (2016). doi:10.4000/jtei.1742.
[29] E. Hyvönen, E. Heino, P. Leskinen, E. Ikkala, M. Koho, M. Tamper, J. Tuominen, E. Mäkelä,
WarSampo data service and semantic portal for publishing Linked Open Data about the
Second World War history, in: The Semantic Web – Latest Advances and New Domains
(ESWC 2016), Springer, 2016, pp. 758–773. doi:10.1007/978-3-319-34129-3\_46.
[30] E. Hyvönen, E. Mäkelä, T. Kauppinen, O. Alm, J. Kurki, T. Ruotsalo, K. Seppälä, J. Takala,
K. Puputti, H. Kuittinen, K. Viljanen, J. Tuominen, T. Palonen, M. Frosterus, R. Sinkkilä,
Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association
for Computational Linguistics, 2018.
[44] M. Koho, E. Hyvönen, E. Heino, J. Tuominen, P. Leskinen, E. Mäkelä, Linked death —
representing, publishing, and using Second World War death records as Linked Open Data,
in: E. Blomqvist, K. Hose, H. Paulheim, A. Ławrynowicz, F. Ciravegna, O. Hartig (Eds.), The
Semantic Web: ESWC 2017 Satellite Events, volume 10577 of Lecture Notes in Computer
Science, Springer, Cham, 2017, pp. 369–383. doi:10.1007/978-3-319-70407-4\_45.
[45] M. Koho, E. Ikkala, E. Hyvönen, Reassembling the lives of Finnish prisoners of the Second
World War on the Semantic Web, in: Proceedings of the Third Conference on Biographical
Data in the Digital Age (BD 2019), CEUR Workshop Proceedings, 2020. In press.
[46] P. Leskinen, M. Koho, E. Heino, M. Tamper, E. Ikkala, J. Tuominen, E. Mäkelä, E. Hyvönen,
Modeling and using an actor ontology of Second World War military units and
personnel, in: C. d’Amato, M. Fernandez, V. Tamma, F. Lecue, P. Cudré-Mauroux, J. Sequeda,
C. Lange, J. Heflin (Eds.), The Semantic Web – ISWC 2017: 16th International Semantic
Web Conference, volume 10588 of Lecture Notes in Computer Science, Springer, Cham, 2017,
pp. 280–296. doi:10.1007/978-3-319-68204-4\_27.
[47] E. Heino, M. Tamper, E. Mäkelä, P. Leskinen, E. Ikkala, J. Tuominen, M. Koho, E. Hyvönen,
Named Entity Linking in a complex domain: Case second world war history, in: J. Gracia,
F. Bond, J. P. McCrae, P. Buitelaar, C. Chiarcos, S. Hellmann (Eds.), Language, Data, and
Knowledge: First International Conference, LDK 2017, volume 10318 of Lecture Notes in
Computer Science, Springer, Cham, 2017. doi:10.1007/978-3-319-59888-8\_10.
[48] M. Tamper, A. Oksanen, J. Tuominen, A. Hietanen, E. Hyvönen, Automatic annotation
service appi: Named entity linking in legal domain, in: The Semantic Web: ESWC 2020
Satellite Events, Springer, 2020, pp. 110–114. doi:10.1007/978-3-030-62327-2\_36.
[49] M. Tamper, E. Hyvönen, P. Leskinen, Visualizing and analyzing networks of named
entities in biographical dictionaries for digital humanities research, in: Proceedings
of the 20th International Conference on Computational Linguistics and Intelligent Text
Processing (CICLing 2019), Springer, 2019. URL: https://seco.cs.aalto.fi/publications/2021/
tamper-et-al-cicling-2021.pdf, forth-coming.
[50] E. Mäkelä, Combining a REST lexical analysis web service with SPARQL for mashup
semantic annotation from text, in: The Semantic Web: ESWC 2014 Satellite Events,
volume 8798 of Lecture Notes in Computer Science, Springer, Cham, 2014, pp. 424–428.
doi:10.1007/978-3-319-11955-7\_60.
[51] E. Hyvönen, Digital humanities on the Semantic Web: Sampo model
and portal series, Semantic Web – Interoperability, Usability,
Applicability (2022). URL: http://www.semantic-web-journal.net/content/
digital-humanities-semantic-web-sampo-model-and-portal-series-0, forth-coming.
[52] A. Bernstein, J. Hendler, N. Noy, A new look at the Semantic Web, Communications of
the ACM, New York, NY, USA 59 (2016) 35–37. doi:10.1145/2890489.
[53] M. Koho, T. Burrows, E. Hyvönen, E. Ikkala, K. Page, L. Ransom, J. Tuominen, D. Emery,
M. Fraas, B. Heller, D. Lewis, A. Morrison, G. Porte, E. Thomson, A. Velios, H. Wijsman,
Harmonizing and publishing heterogeneous pre-modern manuscript metadata as Linked
Open Data, Journal of the Association for Information Science and Technology (JASIST)
73 (2021) 240–257. doi:10.1002/asi.24499.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>English</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swearingen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-P.</given-names>
            <surname>Yee</surname>
          </string-name>
          ,
          <article-title>Finding the flow in web site search</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>45</volume>
          (
          <year>2002</year>
          )
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          . doi:
          <volume>10</volume>
          .1145/567498.567525.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tunkelang</surname>
          </string-name>
          , Faceted Search,
          <source>Synthesis lectures on information concepts</source>
          , retrieval, and services, Morgan &amp; Claypool Publishers,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pollitt</surname>
          </string-name>
          ,
          <article-title>The key role of classification and indexing in view-based searching</article-title>
          ,
          <source>Technical Report, Centre for Database Access Research</source>
          ,University of Huddersfield,
          <year>1998</year>
          . URL: http: //www.ifla.org/IV/ifla63/63polst.pdf. P. Paakkarinen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Laitio</surname>
          </string-name>
          , K. Nyberg, CultureSampo
          <article-title>- a national publication system of cultural heritage on the Semantic Web 2.0</article-title>
          , in:
          <source>Proceedings of the 6th European Semantic Web Conference (ESWC</source>
          <year>2009</year>
          ), Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gordea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Paramita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Isaac</surname>
          </string-name>
          ,
          <article-title>Named entity recommendations to enhance multilingual retrieval in Europeana.eu, in: Foundations of Intelligent Systems</article-title>
          .
          <source>ISMIS</source>
          <year>2020</year>
          ., volume
          <volume>12117</volume>
          of Lecture Notes in Computer Science, Springer, Cham,
          <year>2020</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>112</lpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>030</fpage>
          -59491-6\_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>V.</given-names>
            <surname>Petras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gäde</surname>
          </string-name>
          ,
          <article-title>Europeana - a search engine for digitised cultural heritage material</article-title>
          ,
          <source>Datenbank-Spektrum</source>
          <volume>17</volume>
          (
          <year>2017</year>
          )
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Brandsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lambers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wansleeben</surname>
          </string-name>
          , Can BERT dig it?
          <article-title>- named entity recognition for information retrieval in the archaeology domain</article-title>
          ,
          <source>Journal on Computing and Cultural Heritage</source>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1145/3497842.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tamper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Leskinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ikkala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oksanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mäkelä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Heino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tuominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koho</surname>
          </string-name>
          , E. Hyvönen, AATOS
          <article-title>- a configurable tool for automatic annotation</article-title>
          ,
          <source>in: Proceedings, Language, Data and Knowledge (LDK</source>
          <year>2017</year>
          ), volume
          <volume>10318</volume>
          of Lecture Notes in Computer Science, Springer, Cham,
          <year>2017</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>289</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -59888-8\_
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grenager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Incorporating non-local information into information extraction systems by Gibbs sampling</article-title>
          ,
          <source>in: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)</source>
          , June,
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2005</year>
          , University of Michigan, Ann Arbor, Michigan, USA, Association for Computational Linguistics,
          <year>2005</year>
          , pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          . doi:
          <volume>10</volume>
          .3115/1219840.1219885.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kauppinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Silfverberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lindén</surname>
          </string-name>
          ,
          <article-title>A Finnish news corpus for Named Entity Recognition</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>54</volume>
          (
          <year>2020</year>
          )
          <fpage>247</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Luoma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pyykönen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Laippala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <article-title>A broad-coverage corpus for Finnish Named Entity Recognition</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>4615</fpage>
          -
          <lpage>4624</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>567</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luoma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luotolahti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salakoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ginter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , Multilingual is not enough:
          <source>BERT for Finnish</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1912</year>
          .07076. arXiv:
          <year>1912</year>
          .
          <volume>07076</volume>
          , arXiv.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kettunen</surname>
          </string-name>
          ,
          <article-title>À la recherche du nom perdu-searching for named entities with Stanford NER in a Finnish historical newspaper and journal collection</article-title>
          ,
          <source>in: 13th IAPR International Workshop on Document Analysis Systems</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Pirinen</surname>
          </string-name>
          ,
          <article-title>Development and use of computational morphology of Finnish in the open source and open science era: Notes on experiences with Omorfi development</article-title>
          .,
          <source>SKY Journal of Linguistics</source>
          <volume>28</volume>
          (
          <year>2015</year>
          )
          <fpage>381</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ginter</surname>
          </string-name>
          , T. Salakoski,
          <article-title>Universal lemmatizer: A sequence-to-sequence model for lemmatizing universal dependencies treebanks</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>27</volume>
          (
          <year>2021</year>
          )
          <fpage>545</fpage>
          -
          <lpage>574</lpage>
          . doi:
          <volume>10</volume>
          .1017/S1351324920000224.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hämäläinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alnajjar</surname>
          </string-name>
          ,
          <article-title>The current state of Finnish NLP</article-title>
          ,
          <source>in: Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages, The Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ginter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Miekka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Leino</surname>
          </string-name>
          , T. Salakoski,
          <article-title>Turku Neural Parser pipeline: An end-to-end system for the CoNLL 2018 shared task</article-title>
          ,
          <source>in: Proceedings of the CoNLL 2018</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>