<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Research and development efforts on the digitized his- torical newspaper and journal collection of The National Library of Finland</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>The National Library of Finland</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DH projects Saimaankatu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikkeli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Finland firstname.lastname@helsinki.fi</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.</p>
      </abstract>
      <kwd-group>
        <kwd>historical newspaper collections</kwd>
        <kwd>research and development</kwd>
        <kwd>Finnish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Digitization of historical newspapers and journals printed in Finland began at the end
of the 1990s in The National Library of Finland. Besides producing the digitized
publication data all the time NLF has been involved in research and improvement of
the digitized material during the last years. Last summer we ended a two year
European Regional Development Fund (ERDF) project and started another two year ERDF
project in August 2017. NLF is also involved in research consortium COMHIS,
Computational History and the Transformation of Public Discourse in Finland,
16401910, that is funded by the Academy of Finland (2016–2019). COMHIS utilizes
NLF’s historical newspaper and journal data as one of its main sources in its research
of changes of publicity in Finland.</p>
      <p>
        So far we have focused on text material, and our main achievements have been the
following: a thorough quality analysis of the Finnish part of the textual data on word
level [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], evaluation of tools for named entity recognition and setting up of a NER
evaluation collection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and an improved Optical Character Recognition framework
for the data using Tesseract open source OCR engine [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In March 2017 we
published the newspaper and journal text data as an open data collection on our web
pages [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We have also started work on article extraction from the pages of one
newspaper – Uusi Suometar: 1869 –1918, 86 068 pages.
      </p>
      <p>Newspapers and journals contain also lots of pictures: drawings, maps,
photographs and different graphs which are of interest to both researchers and lay persons.
So far we can detect pictures on the pages of digi.kansalliskirjasto.fi, but a systematic
way of both detecting the pictures and classifying their content needs to be developed.
This is one of our aims during the two year ERDF funding period.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Research and development</title>
      <p>We shall describe our efforts with the newspaper and journal texts in general for the
rest of the paper. Those who are interested in more details can read the published
research papers for each topic.
2.1</p>
      <sec id="sec-2-1">
        <title>Quality analysis and a new OCR process</title>
        <p>
          When originally non-digital materials, such as old newspapers and books, are
digitized, the process starts with scanning of the documents which results in image files.
Out of the image files one needs to sort out texts and possible non-textual data, such
as photographs and other pictorial representations. Texts are recognized from the
scanned pages with Optical Character Recognition (OCR) software. OCRing for
modern text types and fonts is considered as a resolved problem, that yields high
quality results, but results of historical document OCRing are still far from that. Scanned
and OCRed document collections have usually a varying amount of errors in their
content. Tanner et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], for example, report that 78% of the words in the collection
of The 19th Century Newspaper Project of the British Library are correct. This quality
is not good, but quite common to many comparable collections [6].
        </p>
        <p>
          To be able to improve the data of our collections we needed first to get an overall
impression of its quality. We have performed a comprehensive general analysis for
the Finnish data of 1771–1910 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Out of the assessment we have learned that about
70–75% of the words in the Finnish data are probably right and recognizable. In a
collection of about 2.4 billion words this means that about 600–800 million word
tokens and about 200 million word types are wrong.
        </p>
        <p>
          In order to improve the quality of the collection, we started to consider re-OCRing
of the data in 2015. The main reason for this was that the collection had been OCRed
with a proprietary OCR engine, ABBYY FineReader (v.7 and v.8). Newer versions of
the software exist, the latest being 14.0, but the cost of the Fraktur font for OCR is too
high a burden for re-OCRing the collection with ABBYY FineReader. We ended up
using open source OCR engine Tesseract v. 3.04.011 and started to train Fraktur font
for it. This process and its results are described in detail in Koistinen et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We
have an evaluation collection for the re-OCR, and evaluation results show that
Tes1 https://github.com/tesseract-ocr
seract OCR words are recognized by morphological analyzer 9% units better than
words of present OCR (90% vs. 81 % in the evaluation collection).
        </p>
        <p>We have performed a small scale re-OCR with our real data. With 7 937 pages of
Uusi Suometar 1869–1879 morphological analyzer recognized 73% of ca. 13.45 M
words of old OCR. After re-OCR 86.5% of the words were recognized, which is a
13.5% unit improvement in recognition. The data used represents a realistic printed
newspaper of its time with a varying column layout.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Named entity recognition</title>
        <p>Digital collections of the NLF are part of the growing global network of digitized
newspapers and journals, and historical newspapers are considered more and more as
an important source of historical knowledge. As the amount of digitized journalistic
information grows, also tools for harvesting the information are needed. Named
Entity Recognition (NER) has become one of the basic techniques for information
extraction of texts since the mid-1990s [7]. In its initial form NER was used to find and
mark semantic entities like person, location and organization in texts to enable
information extraction related to these kinds of entities. Later on other types of extractable
entities, like time, artefact, event and measure/numerical, have been added to the
repertoires of NER software [7].</p>
        <p>
          We started by performing preliminary NER analysis of our data with a variety of
tools in an evaluation collection [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Results of the evaluation showed that bad OCR
quality harms NER clearly, and also the tools that we had at use were not the best
possible, as they were analyzers of modern Finnish. Anyhow, we were able to achieve
F-score of about 0.5–0.6 at best with persons and locations. Organizations were not
recognized as well.
        </p>
        <p>After the initial evaluation we have set up a larger evaluation and training
collection. We have been able to train a standard statistical NER package, Stanford NER,
with our data. Our initial results with the Stanford NER were mainly promising:
persons achieved F-score of 0.69, locations F-score of 0.74, and organizations F-score of
0.36 in the re-OCRed evaluation collection. With more training data – final word
account is 381 356 words – the results have improved: locations achieve now F-score
of 0.79, persons 0.72 and organizations 0.42. Results for locations and persons can be
considered quite good and the resulting NER model will be used in our web collection
to improve access to the content.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Article extraction</title>
        <p>It is common that historical newspaper collections are digitized on page level: pages
of the physical newspapers are scanned and the page images serve also as the basic
browsing and searching unit of the collection. Searches to the collection are
performed as well as results are shown to the user on page level. Page, however, is not
the basic informational unit of a newspaper, but pages consist of articles, although
length of articles can be quite variable. Thus separation of the article structure of the
digitized newspaper pages is an important step to improve usability of the digital
collections. As Dengel and Shafait [8] formulate it: “availability of logical structure
facilitates navigation and advanced search inside the document as well as enables better
presentation of the document in a possibly restructured format.” Article structure will
also enhance further analysis stages of the content, such as topic modelling or any
other kind of content analysis. Information retrieval performance of the newspaper
collection should also improve, if its content were indexed on article level [9].
Several digitized historical newspaper collections have implemented article extraction on
their pages. Good examples are for example Italian La Stampa, The British
Newspaper Archive, and Australian Trove.</p>
        <p>Article extraction on digitized newspaper pages is not an easy task. Results of the
biannual ICDAR competition on historical newspaper layout analysis show that
current algorithms segment and label about 80–85% of the pages correctly at best [10].</p>
        <p>We have so far investigated different possibilities for article extraction. Available
commercial solutions are either too expensive, not good enough or need too much
post correction for the articles. We have been evaluating one research software,
PIVAJ, that has been developed at the LITIS laboratory of University of Rouen [11].
We intend to use the software for article extraction</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>This work is funded by the European Regional Development Fund and the program
Leverage from the EU 2014–2020.
6. Jarlbrink, J., Snickars, P.: Cultural heritage as digital noise: nineteenth century newspapers
in the digital archive. Journal of Documentation, https://doi.org/10.1108/JD-09-2016-0106
(2017).
7. Nadeau, D., Sekine, S.: A Survey of Named Entity Recognition and Classification.
Linguisticae Investigationes 30, 3–26 (2007).
8. Dengel, A., Shafait, F.: Analysis of the Logical Layout of Documents. In Doerman, D.,
Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, 177–222.</p>
      <p>Springer. DOI 10.1007/978-0-85729-859-1 (2014).
9. Buhr, F., Neumann, B.: Evaluation of Retrieval Performance in Historical Newspaper
Archives comparing Page-level and Article-level Granularity.
https://kogswww.informatik.uni-hamburg.de/publikationen/pub-buhr/Newspaper-Retrieval.pdf (2014).
10. Antonacopoulos, A., Clausner, C., Papadopoulos C., Pletshacher, S.: ICDAR2013
Competition on Historical Newspaper Layout Analysis – HNLA2013. DOI:
10.1109/ICDAR.2013.293 (2013).
11. Hebert, D., Palfray, T., Nicolas, T., Tranouez, P., Paquet, T.: Automatic article extraction
in old Newspapers Digitized Collections. In Proceeding DATeCH '14 Proceedings of the
First International Conference on Digital Access to Textual Cultural Heritage, 3–8.
http://dl.acm.org/citation.cfm?id=2595195 (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pääkkönen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Measuring Lexical Quality of a Historical Finnish Newspaper Collection - Analysis of Garbled OCR Data with Basic Language Technology Tools and Means</article-title>
          . In: Calzolari,
          <string-name>
            <surname>N</surname>
          </string-name>
          et al. (eds.),
          <source>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ) http://www.lrecconf.org/proceedings/lrec2016/pdf/17_Paper.pdf (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mäkelä</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruokolainen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuokkala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Löfberg</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910</article-title>
          .
          <article-title>Digital Humanities Quarterly</article-title>
          . http://www.digitalhumanities.org/dhq/vol/11/3/000333/000333.html (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Koistinen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pääkkönen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur &amp; Antiqua Models and Image Preprocessing</article-title>
          .
          <source>In: Proceedings of Nodalida</source>
          <year>2017</year>
          http://www.ep.liu.se/ecp/131/038/ecp17131038.pdf (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pääkkönen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kervinen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nivala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kettunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mäkelä</surname>
            ,
            <given-names>E.: Exporting</given-names>
          </string-name>
          <string-name>
            <surname>Finnish Digitized Historical Newspaper Contents for Offline Use. D-Lib</surname>
            <given-names>Magazine</given-names>
          </string-name>
          , July/August. http://www.dlib.org/dlib/july16/paakkonen/07paakkonen.html (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tanner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muñoz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ros</surname>
            ,
            <given-names>P.H.</given-names>
          </string-name>
          :
          <article-title>Measuring Mass Text Digitization Quality and Usefulness. Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive</article-title>
          .
          <string-name>
            <surname>D-Lib</surname>
            <given-names>Magazine</given-names>
          </string-name>
          , (
          <volume>15</volume>
          /8) http://www.dlib.org/dlib/july09/munoz/07munoz.html (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>