<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kimmo Kettunen</string-name>
          <email>kimmo.kettunen@helsinki.fi</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eetu Mäkelä</string-name>
          <email>eetu.makela@aalto.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juha Kuokkala</string-name>
          <email>juha.kuokkala@helsinki.fi</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Teemu Ruokolainen</string-name>
          <email>teemu.ruokolainen@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jyrki Niemi</string-name>
          <email>jyrki.niemi@helsinki.fi</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University, Department of Signal Processing and Acoustics</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Aalto University, Semantic Computing Research Group</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Library of Finland, Centre for Preservation and Digitization</institution>
          ,
          <addr-line>Mikkeli</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Helsinki, Department of Modern Languages</institution>
          ,
          <addr-line>Helsinki</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 17711910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74-75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco's tools achieve 30.0-60.0 F-score with locations and persons. Performance of FiNER and SeCo's tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity recognition</kwd>
        <kwd>historical newspaper collections</kwd>
        <kwd>Finnish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The National Library of Finland has digitized a large proportion of the historical
newspapers published in Finland between 1771 and 1910 [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. This collection
contains 1,960,921 million pages in Finnish and Swedish. Finnish part of the collection
consists of about 2.39 billion words. The National Library’s Digital Collections are
offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of the
newspaper material (years 1771–1874) is freely downloadable in The Language Bank
1
of Finland provided by the FIN-CLARIN consortium . The collection can also be
accessed through the Korp2 environment that has been developed by Språkbanken at
the University of Gothenburg and extended by FIN-CLARIN team at the University
of Helsinki to provide concordances of text resources. A Cranfield style information
retrieval test collection has been produced out of a small part of the Digi newspaper
material at the University of Tampere [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The web service digi.kansalliskirjasto.fi is used, for example, by genealogists,
heritage societies, researchers, and history enthusiast laymen. There is also an increasing
desire to offer the material more widely for educational use. In 2015 the service had
about 14 million page loads. User statistics of 2014 showed that about 88.5 % of the
usage of the Digi came from Finland, but an 11.5 % share of use was coming outside
of Finland.</p>
      <p>
        Named entity recognition has become one of the basic techniques for information
extraction of texts. In its initial form NER was used to find and mark semantic
entities like person, location and organization in texts to enable information extraction
related to these kinds of entities. Later on other types of extractable entities, like time,
artefact, event and measure/numerical, have been added to the repertoires of NER
software [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Our aim with usage of NER is to provide users of Digi better means for searching
and browsing of the historical newspapers. Different types of names, especially
person names and names of locations are used frequently as search terms in different
newspaper collections [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They can provide also browsing assistance to collections, if
the names are recognized and tagged in the newspaper data and put into the index [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
A fine example of usage of name recognition with historical newspapers is La
Stampa’s historical newspaper collection3. After basic keyword search users can browse or
filter the search results by using three basic NER categories of person (authors of
articles or persons mentioned in the articles), location (countries and cities mentioned
in the articles) and organization. Thus entity annotations of newspaper text allow a
more semantically-oriented exploration of content of the large archive. A large scale
(152 M articles) NER analysis and usage examples of the Australian historical
newspaper collection Trove is described in Mac Kim and Cassidy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
1 https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/KielipankkiAineistotDigilibPub
2 https://korp.csc.fi/
3 http://www.archiviolastampa.it/
      </p>
    </sec>
    <sec id="sec-2">
      <title>NER Software and Evaluation</title>
      <p>
        For recognition and labelling of named entities we use principally FiNER software.
SeCo’s ARPA is of different type, it is mainly used for Semantic Web tagging and
linking of entities [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]4, but it could be adapted for basic NER, too. Before choosing
FiNER we also tried a commonly used trainable free tagger, Stanford NER5, but were
not able to get reasonable performance out of it for our purposes.
      </p>
      <p>FiNER is a rule-based named-entity tagger, which in addition to surface text forms
utilizes grammatical and lexical information from a morphological analyzer
(Omorfi6). FiNER pre-processes the input text with a morphological tagger derived
from Omorfi. The tagger disambiguates Omorfi’s output by selecting the statistically
most probable morphological analysis for each word token, and for tokens not
recognized by the analyzer, guesses an analysis by analogy of word-forms with similar
ending in the morphological dictionary. The use of morphological pre-processing is
crucial in performing NER with a morphologically rich language such as Finnish,
where a single lexeme may theoretically have thousands of different inflectional
forms.</p>
      <p>The focus of FiNER is in recognizing different types of proper names.
Additionally, it can identify the majority of Finnish expressions of time and e.g. sums of money.
FiNER uses multiple strategies in its recognition task:</p>
      <p>1) Pre-defined gazetteer information of known names of certain types. This
information is mainly stored in the morphological lexicon as additional data tags of the
lexemes in question. In the case of names consisting of multiple words, FiNER rules
incorporate a list of known names not caught by the more general rules.</p>
      <p>2) Several kinds of pattern rules are being used to recognize both single- and
multiple-word names based on their internal structure. This typically involves (strings of)
capitalized words ending with a characteristic suffix such as Inc, Corp, Institute etc.
Morphological information is also utilized in avoiding erroneously long matches,
since in most cases only the last part of a multi-word name is inflected, while the
other words stay in the nominative (or genitive) case. Thus, preceeding capitalized
words in other case forms should be left out of a multi-word name match.</p>
      <p>3) Context rules are based on lexical collocations, i.e. certain words which
typically or exclusively appear next to certain types of names in text. For example, a string
of capitalized words can be inferred to be a corporation/organization if it is followed
by a verb such as tuottaa (‘produce’), työllistää (‘employ’) or lanseerata (‘launch’ [a
product]), or a personal name if it is followed by a comma- or parenthesis-separated
numerical age or an abbreviation for a political party member.</p>
      <p>
        The pattern-matching engine that FiNER uses, HFST Pmatch, marks leftmost
longest non-overlapping matches satisfying the rule set (basically a large set of
disjuncted patterns) [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. In the case of two or more rules matching the exact same
passage in the text, the choice of the matching rule is undefined. Therefore, more
4 An older demo version of the tool is available at http://demo.seco.tkk.fi/sarpa/#/
5 http://nlp.stanford.edu/software/CRF-NER.shtml
6 https://github.com/flammie/omorfi
control is needed in some cases. Since HFST Pmatch did not contain a rule weighing
mechanism at the time of designing the first release of FiNER, the problem was
solved by applying two runs of distinct Pmatch rulesets in succession. This solves for
instance the frequent case of Finnish place names used as family names: in the first
phase, words tagged lexically as place names but matching a personal name context
pattern are tagged as personal names, and the remaining place name candidates are
tagged as places in the second phase. FiNER annotates 15 different entities that
belong to five categories: location, person, organization, measure and time [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        SeCo’s ARPA [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is not actually a NER tool, but instead a dynamic, configurable
entity linker. In effect, ARPA is not interested in locating all entities of a particular
type in a text, but instead locating all entities that can be linked to strong identifiers
elsewhere. Through these, it is then for example possible to source coordinates for
identified places, or associate different name variants and spellings to a single
individual. For the pure entity recognition task presented in this paper, ARPA is thus at a
disadvantage. However, we wanted to see how it would fare in comparison to FiNER.
      </p>
      <p>The core benefits of the ARPA system lie in its dynamic, configurable nature. In
processing, ARPA combines a separate lexical processing step with a configurable
SPARQL-query -based lookup against an entity lexicon stored at a Linked Data
endpoint. Lexical processing for Finnish is done with a modified version of Omorfi7,
which supports historical morphological variants, as well as lemma guessing for out
of vocabulary words. This separation of concerns allows the system to be speedily
configured for both new reference vocabularies as well as the particular dataset to be
processed.
2.1</p>
      <sec id="sec-2-1">
        <title>Evaluation Data</title>
        <p>
          As evaluation data for FiNER we used samples from the Digi collection. Kettunen
and Pääkkönen [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] calculated among other things number of words in the data for
different decades. It turned out that most of the newspaper data was published in
1870–1910, and beginning and mid of the 19th century had much less published
material. About 95 % of the material was printed in 1870–1910, and most of it, 82.7 %, in
two decades of 1890–1910.
        </p>
        <p>We aimed at an evaluation collection of 150,000 words. To emphasize the
importance of the 1870–1910 material we took 50 K of words from time period 1900–
1910, 10 K from 1890–1899, 10 K from 1880–1889, and 10 K from 1870–1879. Rest
70 K of the material was picked from time period of 1820–1869. Thus the collection
reflects most of the data from the century but is also weighed to the end of the 19th
century and beginning of 20th century.</p>
        <p>
          The final manually tagged evaluation data consists of 75,931 lines, each line
having one word or other character data. The word accuracy of the evaluation sample is
on the same level as the whole newspaper collection’s word level quality: about 73 %
of the words can be recognized by a modern Finnish morphological analyzer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. 71
        </p>
        <sec id="sec-2-1-1">
          <title>7 https://github.com/jiemakel/omorfi</title>
          <p>% of the tagger’s input snippets have five or more words, the rest have fewer than five
words in the text snippet.</p>
          <p>
            FiNER’s 15 tags for different types of entities is too fine a distinction for our
purposes. Our first aim was to concentrate only on locations and person names, because
they are mostly used in searches of the Digi collection, as was detected in an earlier
log analysis [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. After reviewing some of the FiNER tagged material, we included
also three other tags, as they seemed important and were occurring frequently enough
in the material. The final chosen eight tags are shown and explained below.
          </p>
          <p>Entity/tag
1. &lt;EnamexPrsHum&gt;
2. &lt;EnamexLocXxx&gt;
3. &lt;EnamexLocGpl&gt;
4. &lt;EnamexLocPpl&gt;
5. &lt;EnamexLocStr&gt;
6. &lt;EnamexOrgEdu&gt;
7. &lt;EnamexOrgCrp&gt;
8. &lt;TimexTmeDat&gt;</p>
          <p>Meaning
person
general location
geographical location
political location (state, city etc.)
street, road, street address
educational organization
company, society, union etc.
expression of time
The final entities show that our interest is mainly in the three most used semantic
NER categories: persons, locations and organizations. With locations we use two
subcategories and with organizations one. Temporal expressions were included in the tag
set due to their general interest in the newspaper material.</p>
          <p>
            Manual tagging of the evaluation material was done by the fourth author, who had
previous experience of tagging modern Finnish with tags of the FiNER tagger.
Tagging took one month, and quality of the tagging and its principles were discussed
before starting based on a sample of 2000 lines of evaluation data. It was agreed, for
example, that words that are misspelled but are recognizable for the human tagger as
named entities would be tagged (cf. 50 % character correctness rule in Packer et al.
[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]). If orthography of the word was following 19th century spelling rules, but the
word was identifiable as a named entity, it would be tagged, too.
2.2
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Results of the Evaluation</title>
        <p>
          We evaluated performance of FiNER and SeCo’s ARPA using the conlleval8 script
used in Conference on Computational Natural Language Learning (CONLL).
Evaluation is based on “exact-match evaluation” [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. In this type of evaluation NER
system is evaluated based on the micro-averaged F-measure (MAF) where precision
is the percentage of correct named entities found by the NER software; recall is the
percentage of correct named entities present in the tagged evaluation corpus that are
found by the NER system. A named entity is considered correct only if it is an exact
8 http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt, author ErikTjong Kim Sang,
version 2004-01-26
match of the corresponding entity in the tagged evaluation corpus: “a result is
considered correct only if the boundaries and classification are exactly as annotated” [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
Thus the evaluation criteria are strict, especially for multipart entities.
        </p>
        <p>Detailed results of the evaluation of FiNER are shown in Table 1. Entities &lt;ent/&gt;
consist of one word token, &lt;ent&gt; are part of a multiword entity and &lt;/ent&gt; are last
parts of multiword entities.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Label P R F-score</title>
        <p>&lt;EnamexLocGpl/&gt;
&lt;EnamexLocPpl/&gt;
&lt;EnamexLocStr/&gt;
&lt;EnamexLocStr&gt;
&lt;/EnamexLocStr&gt;
&lt;EnamexOrgCrp/&gt;
&lt;EnamexOrgCrp&gt;
&lt;/EnamexOrgCrp&gt;
&lt;EnamexOrgEdu&gt;
&lt;/EnamexOrgEdu&gt;
&lt;EnamexPrsHum/&gt;
&lt;EnamexPrsHum&gt;
&lt;/EnamexPrsHum&gt;
&lt;TimexTmeDat/&gt;
&lt;TimexTmeDat&gt;
&lt;/TimexTmeDat&gt;</p>
        <p>6.96
89.50
23.33
100.00
100.00</p>
        <p>2.39
44.74
40.74
48.28
55.17
16.38
87.44
82.88
5.45
68.54
20.22
9.41
8.46
50.00
13.83
18.31</p>
        <p>6.62
25.99
31.95
40.00
64.00
52.93
26.67
31.62
14.75
2.14
2.00
8.00
15.46
31.82
24.30
30.95
3.52
32.88
35.81
43.75
59.26
25.02
40.88
45.78
7.96
4.14
3.65</p>
        <p>Results of the evaluation show that named entities are not recognized very well,
which is not surprising, as the quality of the text data is quite low. Especially
recognition of multipart entities is very low. Some part of the entities may be recognized, but
rest is not. Out of multiword entities corporations and educational organizations are
recognized best. Names of persons are the most frequent category. Recall of one part
person names is best, but its precision is low. Multipart person names have a more
balanced recall and precision, even if their overall recognition is not high.</p>
        <p>In a looser evaluation the categories were treated so that any correct marking of an
entity regardless its boundaries was considered a hit. Four different location
categories were joined to two: general location &lt;EnamexLocXxx&gt; and that of street names.
End result was six different categories instead of eight. Table 2 shows evaluation
results with loose evaluation. Recall and precision of the most frequent categories of
person and location was now clearly higher, but still not very good.</p>
        <p>Label
&lt;EnamexPrsHum&gt;
&lt;EnamexLocXxx&gt;
&lt;EnamexLocStr&gt;
&lt;EnamemOrgEdu&gt;
&lt;EnamemOrgCrp&gt;
&lt;TimexTmeDat&gt;</p>
        <p>P</p>
        <p>R</p>
        <p>
          Our third evaluation was performed for a limited tag set with tools of the SeCo’s
ARPA. First only places were identified so that one location, EnamexLocPpl, was
recognized. For this task, ARPA was first configured for the task of identifying place
names in the data. As a first iteration, only the Finnish Place Name Registry9 was
used. After examining raw results from the test run, three issues were identified for
further improvement. First, PNR contains only modern Finnish place names. To
improve recall, three registries containing historical place names were added: 1) the
Finnish spatiotemporal ontology SAPO [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] containing names of historic
municipalities, 2) a repository of old Finnish maps and associated places from the 19th and early
20th Century, and 3) a name registry of places inside historic Karelia, which does not
appear in PNR due to being ceded by Finland to the Soviet Union at the end of the
Second World War [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. To account for international place names, the names were
also queried against the Geonames database10 as well as Wikidata11. The contributions
of each of these resources to the number of places identified in the final runs are
shown in Table 3. Note that a single place name can be, and often was found in
multiple of these sources.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Source</title>
        <p>Karelian places
Old maps
Geonames
SAPO
Wikidata
PNR</p>
        <sec id="sec-2-4-1">
          <title>9 http://www.ldf.fi/dataset/pnr/ 10 http://geonames.org/ 11 http://wikidata.org/</title>
          <p>Matches
461
685
1036
1467
1877
2232</p>
          <p>Table 4 describes the results of location recognition with ARPA. Without one
exception (New York), only one word entities were discovered by the software</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>Label</title>
        <p>&lt;EnamexLocPpl/&gt;
&lt;/EnamexLocPpl&gt;
&lt;EnamexLocPpl&gt;</p>
        <p>P</p>
        <p>R</p>
        <p>A second improvement to the ARPA process arose from the observation that while
recall in the first test run was high, precision was low. Analysis revealed this to be
due to many names being both person names as well as places. Thus, a filtering step
was added, that removed 1) hits identified as person names by the morphological
analyzer and 2) hits that matched regular expressions catching common person name
patterns found in the data (I. Lastname and FirstName LastName). However,
sometimes this was too aggressive, ending up for example in filtering out also big cities
like Tampere and Helsinki. Thus, in the final configuration, this filtering was made
conditional on the size of the identified place, as stated in the structured data sources
matched against.</p>
        <p>Finally, as the amount of OCR errors in the target dataset was identified to be a
major hurdle in accurate recognition, experiments were made with sacrificing
precision in favor of recall through enabling various levels of Levenshtein distance
matching against the place name registries. In this test, the fuzzy matching was done in the
query phase after lexical processing. This was easy to do, but doing the fuzzy
matching during lexical processing would probably be more optimal, as currently lemma
guessing (which is needed because OCR errors are out of the lemmatizer’s
vocabulary) is extremely sensitive to OCR errors particularly in the suffix parts of words.</p>
        <p>After the place recognition pipeline was finalized, a further test was done to test if
the ARPA pipeline could be used for also person name recognition. Here, as a lexicon
of names, the Virtual International Authority File was used, as it contains 33 million
names for 20 million people. In the first run, the query simply matched all uppercase
words against both first and last names in this database, while allowing for any
number of initials to also precede such names matched. This way, the found names can’t
actually be always any more linked to strong identifiers, but for a pure NER task,
recall is improved.</p>
        <p>Table 5 shows results of this evaluation without fuzzy matching of names and
Table 6 with fuzzy matching.
We have shown in this paper first evaluation results of NER for historical Finnish
newspaper material from the 19th and early 20th century with two different tools,
FiNER and SeCo’s ARPA. Word level correctness of the digitized newspaper archive is
approximately 70–75 %; the evaluation corpus had a word level correctness of about
73 %. Regarding this and the fact that FiNER and ARPA were developed for modern
Finnish, the newspaper material makes a very difficult test for named entity
recognition. It is obvious that the main obstacle of high class NER in this material is bad
quality of the text. Also historical spelling variation has some effect, but it should not
be that high.</p>
        <p>Evaluation results in this phase were not very good, best basic F-scores were
ranging from 30 to 60 in the basic evaluation, and slightly better in a looser evaluation.
We have ongoing trials for improving word quality of our material, which may yield
also better NER results. We made some unofficial tests with three versions of a
500,000 word text material that is different from our NER evaluation material but
derives from the 19th century newspapers as well. One version was manually
corrected OCR, another old OCRed version and third a new OCRed version. Besides
character level errors also word order errors have been corrected in the two new versions.
For these texts we did not have a ground truth tagged version, so we could only count
marking of NER tags. With FiNER total number of tags increased from 23,918 to
26,674 (+11.5 % units) in the manually corrected text version. Number of tags
increased to 26,424 tags (+10.5 % units) in the new OCRed text version. Most notable
increase in the number of tags was in categories EnamexLocStr and EnamexOrgEdu.
With ARPA results were even slightly better. ARPA recognized 10 853 places in the
old OCR, 11,847 in the new OCR (+ 9.2 % units) and 13,080 (+20.5 % units) in the
ground truth version of the text. There is about a 10–20 % unit overall increase in the
number of NER tags in both of the new better quality text versions in comparison to
the old OCRed text with both taggers.</p>
        <p>
          NER experiments with OCRed data in other languages show usually some
improvement of NER when the quality of the OCRed data has been improved from very
poor to somehow better [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. Results of Alex and Burns [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] imply that with
lower level OCR quality (below 70 % correctness) name recognition is harmed clearly.
Packer et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] report partial correlation of Word Error Rate of the text and
achieved NER result; their experiments imply that word order errors are more
significant than character errors. On the other hand, results of Rodriquez et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] show,
that manual correction of OCRed material that has 88–92 % word accuracy does not
increase performance of four different NER tools significantly. As the word accuracy
of our material is low, it would be expectable, that somehow better recognition results
would be achieved, if the word accuracy was round 80–90 % instead of 70–75 %. Our
informal test with different quality texts suggests this, too. Our material has also quite
a lot of word order errors which may affect results.
        </p>
        <p>
          Another option for better recognition results is that we can use more historical
language sensitive NER software. Such may become available, if the historically more
sensitive version of morphological recognizer Omorfi can be merged with FiNER. A
third possibility is to train a statistical name tagger described by Silfverberg [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] with
labeled historical newspaper material.
        </p>
        <p>
          Other causes for poor performance are probably due to 19th century Finnish
spelling variation and perhaps also due to different writing conventions of the era. It is
possible, for example, that the genre of 19th century newspaper writing differs from
modern newspaper writing in some crucial aspects. Considering that both FiNER and
ARPA are made for modern Finnish, our evaluation data is heavily out of their main
scope [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], even if ARPA uses historical Finnish aware Omorfi.
        </p>
        <p>
          In our case extraction of names is primarily a tool for improving access to the Digi
collection. After getting the recognition rate of the NER tool to acceptable level, we
need to decide, how we are going to use extracted names in Digi. Some exemplary
suggestions are provided by archive of La Stampa and Trove Names [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. La Stampa
style usage of names provides informational filters after a basic search has been
conducted. You can further look for persons, locations and organizations mentioned in
the article results. This kind of approach enables browsing access to the collection and
possibly also entity linking [
          <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
          ]. Trove Names’s name search takes the
opposite approach: you first search for names and then you get articles where the names
occur. We believe that the La Stampa style of usage of names in the GUI of the
newspaper collection is more informative and useful for users, as the Trove style can be
already obtained with the normal search function in the GUI of the newspaper
collection. If we consider possible uses of NER in Digi, FiNER does so far only basic
identifying and classification of names. ARPA is basically not a NER software, but a
semantic entity linking system, and thus of broader use. Our main emphasis with NER
will be on the use of the names with the newspaper collection as a means to improve
browsing and general informational usability of the collection. A good enough
coverage of the names with NER needs to be achieved also for this use, of course. A good
balance of P/R should be found for this purpose [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], but also other capabilities of the
software need to be considered. These remain to be seen later, if we are able to
connect some type of functional NER to our historical newspaper collection.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>First author is funded by the EU Commission through its European Regional
Development Fund, and the program Leverage from the EU 2014–2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Survey of Named Entity Recognition and Classification</article-title>
          .
          <source>Linguisticae Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pääkkönen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Measuring Lexical Quality of a Historical Finnish Newspaper Collection - Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. Accepted for LREC 2016</article-title>
          . http://lrec2016.lrec-conf.org/en/conferenceprogramme/accepted-papers/ (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bremer-Laamanen</surname>
          </string-name>
          , M-L.
          <article-title>: In the Spotlight for Crowdsourcing</article-title>
          .
          <source>Scandinavian Librarian Quarterly</source>
          ,
          <volume>1</volume>
          ,
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Honkela</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lindén</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kauppinen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pääkkönen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kervinen</surname>
          </string-name>
          , J.:
          <article-title>Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods</article-title>
          .
          <source>In: Proceedings of IFLA</source>
          <year>2014</year>
          ,
          <string-name>
            <surname>Lyon</surname>
          </string-name>
          (
          <year>2014</year>
          ) http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Järvelin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keskustalo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sormunen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saastamoinen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Information Retrieval from Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach</article-title>
          .
          <article-title>Journal of the Association for Information Science</article-title>
          and Technology doi: http://onlinelibrary.wiley.com/doi/10.1002/asi.23379/epdf (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kokkinakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niemi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardwick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lindén</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borin</surname>
          </string-name>
          . L.:
          <article-title>HFST-SweNER - a New NER Resource for Swedish</article-title>
          .
          <source>In: Proceedings of LREC</source>
          <year>2014</year>
          , http://www.lrecconf.org/proceedings/lrec2014/pdf/391_Paper.pdf (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Crane</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th-Century Newspaper Collection</article-title>
          .
          <source>In Proceedings of JCDL'06</source>
          ,
          <string-name>
            <surname>June</surname>
          </string-name>
          11- 15,
          <year>2006</year>
          ,
          <string-name>
            <given-names>Chapel</given-names>
            <surname>Hill</surname>
          </string-name>
          , North Carolina, USA. http://repository01.lib.tufts.edu:8080/fedora/get/tufts:PB.
          <volume>001</volume>
          .001.00007/Archival.pdf (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Neudecker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilms</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faber</surname>
            , W. J., van Veen,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Large-scale Refinement of Digital Historic Newspapers with Named Entity Recognition</article-title>
          . http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-neudecker_faber_wilmsen.pdf (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Mac</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Cassidy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Finding Names in Trove: Named Entity Recognition for Australian</article-title>
          .
          <source>In: Proceedings of Australasian Language Technology Association Workshop</source>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>65</lpage>
          , https://aclweb.org/anthology/U/U15/U15-1007.pdf (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mäkelä</surname>
          </string-name>
          , E.:
          <article-title>Combining a REST Lexical Analysis Web Service with SPARQL for Mashup Semantic Annotation from Text</article-title>
          . In
          <string-name>
            <surname>Presutti</surname>
          </string-name>
          , V. et al. (eds.),
          <source>The Semantic Web: ESWC 2014 Satellite Events. Lecture Notes in Computer Science</source>
          , vol.
          <volume>8798</volume>
          , pp.
          <fpage>424</fpage>
          -
          <lpage>428</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lindén</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axelson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drobac</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardwick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuokkala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niemi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pirinen</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silfverberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :.
          <article-title>HFST-a System for Creating NLP Tools</article-title>
          . In Mahlow,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Piotrowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>(eds.) Systems and Frameworks for Computational Morphology</article-title>
          . Third International Workshop, SFCM 2013, Berlin, Germany, September 6,
          <issue>2013</issue>
          Proceedings, pp.
          <fpage>53</fpage>
          -
          <lpage>71</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Silfverberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Reverse Engineering a Rule-Based Finnish Named Entity Recognizer</article-title>
          . https://kitwiki.csc.fi/twiki/pub/FinCLARIN/KielipankkiEventNERWorkshop2015/Silfverb erg_presentation.pdf (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Packer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutes</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Embley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seppi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          :
          <article-title>Extracting Person Names from Diverse and Noisy OCR Text</article-title>
          .
          <source>In: Proceedings of the fourth workshop on Analytics for noisy unstructured text data</source>
          . Toronto, ON, Canada: ACM. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Marrero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urbano</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sánchez-Cuadrado</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Berbís</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          :
          <article-title>Named Entity Recognition: Fallacies, challenges and opportunities</article-title>
          .
          <source>Computer Standards &amp; Interfaces</source>
          <volume>35</volume>
          ,
          <fpage>482</fpage>
          -
          <lpage>489</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rodrigues</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bryant</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanke</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luszczynska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Comparison of Named Entity Recognition Tools for raw OCR text</article-title>
          .
          <source>In: Proceedings of KONVENS</source>
          <year>2012</year>
          (
          <article-title>LThist 2012 wordshop</article-title>
          ),
          <source>Vienna September 21</source>
          , pp.
          <fpage>410</fpage>
          -
          <lpage>414</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Alex</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burns</surname>
          </string-name>
          , J.:
          <article-title>Estimating and Rating the Quality of Optically Character Recognised Text</article-title>
          .
          <source>In: DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage</source>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>102</lpage>
          . http://dl.acm.org/citation.cfm?id=
          <volume>2595214</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Poibeau</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosseim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Proper Name Extraction from Non-Journalistic Texts</article-title>
          .
          <source>Language and Computers</source>
          ,
          <volume>37</volume>
          , pp.
          <fpage>144</fpage>
          -
          <lpage>157</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Hyvönen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuominen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kauppinen</surname>
            <given-names>T</given-names>
          </string-name>
          , Väätäinen,
          <string-name>
            <surname>J</surname>
          </string-name>
          : Representing and
          <article-title>Utilizing Changing Historical Places as an Ontology Time Series</article-title>
          . In Ashish, N. and
          <string-name>
            <surname>Sheth</surname>
          </string-name>
          , V. (eds.) Geospatial Semantics and Semantic Web: Foundations, Algorithms, and Applications, Springer-Verlag,
          <article-title>(</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ikkala</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuominen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hyvönen</surname>
          </string-name>
          , E.:
          <article-title>Contextualizing Historical Places in a Gazetteer by Using Historical Maps and Linked Data</article-title>
          .
          <source>In: Proceedings of Digital Humanities</source>
          <year>2016</year>
          , short papers, Kraków, Poland (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Bates</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>What is Browsing - really? A Model Drawing from Behavioural Science Research</article-title>
          . Information Research 12. http://www.informationr.net/ir/12-4/paper330.html (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Toms</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          :
          <article-title>Understanding and Facilitating the Browsing of Electronic Text</article-title>
          .
          <source>International Journal of Human-Computer Studies</source>
          ,
          <volume>52</volume>
          (
          <issue>3</issue>
          ),
          <fpage>423</fpage>
          -
          <lpage>452</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatko</surname>
          </string-name>
          , C.D.: Processing Named Entities in Text.
          <source>Johns Hopkins APL Technical Digest</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>