<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Traveling through Space and Time, or: Making Historical Travelogues Accessible.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>n Rör</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rainer Simon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Schlarb</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIT Austrian Institute of Technology</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Investigating perceptions of Otherness is the overarching goal of the Travelogues project. It studies a corpus comprising of thousands of recently digitized travelogues dating back to the 16th century held by the Austrian National Library. Driven by an interdisciplinary team of historians and data scientists, it aims at making knowledge that is now hidden in a huge text corpus accessible to researchers. In the current, initial project phase, we explore how statistical methods, such as word embeddings, can be used to assess the structure and semantics of large text corpora in order to make those resources accessible. We developed an initial methodology that combines visual and statistical cues for identifying possible starting points for a more fine-grained text corpus exploration. Ultimately, this data-driven approach is expected to result in new and possibly unexpected insights stemming from resources that were previously de-facto inaccessible.</p>
      </abstract>
      <kwd-group>
        <kwd>Digital Humanities Machine learning Information extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Close reading (the careful and exhaustive interpretation of a passage of text) of
source material is a key methodology and daily routine for historians. However,
this methodology has its natural limits when being applied on large digitized
book holdings such as the Austrian Books Online1, which comprises more than
400,000 volumes including titles from the early 16th century up to the second
half of the 19th century.</p>
      <p>The Travelogues2 project focuses on historical travel reports ranging from the
16th to late 19th century, published in German language, and works towards
developing a toolbox that helps exploring source materials and understanding their
semantics. Aiming for qualitative insights into historical sources, we describe a
set of methods and their combination that can help domain experts (in our case,
historians) in finding answers to research questions based on source material
hidden in huge text corpora.
1 https://www.onb.ac.at/digitale-bibliothek-kataloge/austrian-books-online-abo/
2 http://www.travelogues-project.info</p>
      <p>In addition to applying well-known information retrieval methods, we also
aim at enabling the presentation of results in ways that are easily accessible by
non-experts as well. Uncovering previously hidden knowledge is the priority of
this project.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Dataset</title>
      <p>
        The corpus in question consists of more than 150,000 recently digitized works
in German language (out of a total 419,000 for the given time frame), which
are part of the Austrian National Library’s inventory, with all works published
between 1500 and 1875.3 Out of those, an estimated 1,000 to 2,000 books can be
classified as travel reports, either completely or containing chapters describing
travels. Travel reports are an important source genre in the historical science [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
and serial analysis has been suggested as a way to analyze them [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Innate and challenging properties of this corpus include, for example, the
linguistic variety, which comes from works spanning 450 years of German language,
most of it without standardized orthography. Additionally, problems arise due
to the digitization process such as OCR errors (especially with texts printed in
Fraktur) and loss of information due to missing layout information. For example,
there can be the unclean appearance of individual characters due to old
printing techniques, letters from the inverse page can be visible because of thin book
page paper (bleed-through), or the book pages might be damaged due to storage
conditions (e.g. humidity caused page warping) or by catastrophic events (e.g.
fire). On top of that, the correct classification of travel reports is a problem to be
solved. This is also due to incomplete or incorrect meta data from the existing
catalog, which was in parts created in the 19th century.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        In a first step, we make use of the word2vec algorithm [
        <xref ref-type="bibr" rid="ref4 ref5">5, 4</xref>
        ] and visualizations
through t-SNE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>To develop our methodology, we worked on a small subset of the corpus. This
subset consists of 100 books, published between 1800-1875 and selected through
basic criteria - the title had to contain the German word for travel, ’Reise’, in one
of four different spellings (to cover language variations: ’reis*’, ’reyß*’, ’reiß*’,
’reys*’). The result was a mid-sized corpus of just over 10 million words.</p>
      <p>We kept the preprocessing of the source material to a minimum, to preserve
as much of the original structure and wording of the text as possible. Those
aspects were deemed important for further language analysis.</p>
      <p>
        We then applied word2vec4 to extract semantic relationships contained in
the test corpus, and plotted word clusters using tSNE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3 Letterpress printing was widely available from the early 16th century, and after 1875,
copyright issues might arise - hence this time frame.
4 The gensim[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] implementation was used.
      </p>
      <p>Through Space and Time: Historical Travelogues</p>
    </sec>
    <sec id="sec-4">
      <title>Preliminary results</title>
      <p>The first results and observations are promising, as we were quickly able to get
an intuition of the main topics of the source material. Plotting the 500 most
frequent words immediately drew attention to thematic clusters, such as
settlements (different types of buildings from huts to fortifications to cathedrals),
nobility (rulers, tribes etc.), food (types of meat, methods of preparation),
seasons and also abstract topics like emotions. By plotting the semantically closest
words to ’danger’ (see figure 1), we are able to draw a picture of dangers
associated with traveling, as far as our subset is concerned. This includes seafaring,
bandits and bad roads; illness; but also feelings like fear, anger and sadness.
Interestingly, many adjectives were also included - hinting that this topic is
especially connected to vivid descriptions and emotions. Other topics centered on
food or cities offer insights into contemporary descriptions.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and further work</title>
      <p>We have learned that text-exploration techniques like word embeddings can be
applied to a basically unknown text corpus in a domain that is challenging
through linguistic variety and OCR errors. In the long run, we plan to update
the library catalogue with additional and corrected meta data. For the next
steps, three historians picked a variety of representative travelogues, published
between 1700-1875. They consist of just over 2.5 million words, and we will use
them as the starting point in creating a classifier to identify additional,
nonobvious travelogues in the corpus.</p>
      <p>We also plan to introduce additional approaches that can be used for
information extraction, such as topic models and tf-idf. Furthermore, the ground
truth will be manually annotated using the software platform Recogito5 to mark
certain semantic structures. The goal of this approach is the identification of the
semantic concepts that describe notions of Otherness and its evolution through
time.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work is supported by the Austrian FWF as project I 3795 and the German
DFG as project 398697847.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Harbsmeier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Reisebeschreibungen als mentalitätsgeschichtliche quellen: Überlegungen zu einer historisch-anthropologischen untersuchung frühneuzeitlicher deutscher reisebeschreibungen. Reiseberichte als Quellen europäischer Kulturgeschichte, hg</article-title>
          . v. A. Mączak, HJ Teuteberg pp.
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Maaten</surname>
          </string-name>
          , L.v.d.,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of machine learning research 9(Nov)</source>
          ,
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Maczak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teuteberg</surname>
            ,
            <given-names>H.J.:</given-names>
          </string-name>
          <article-title>Reiseberichte als Quellen europäischer Kulturgeschichte: Aufgaben und Möglichkeiten der historischen Reiseforschung:[Vorträge gehalten anlässlich des 9</article-title>
          .
          <source>Wolfenbütteler Symposions vom 22. bis 25</source>
          .
          <article-title>Juni 1981 in der Herzog August Bibliothek]</article-title>
          .
          <source>Herzog August Bibliothek</source>
          (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Řehůřek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          . pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . ELRA, Valletta, Malta (May
          <year>2010</year>
          ), http://is.muni.cz/ publication/884893/en
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>