<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Challenges of German Archival Document Categorization on Insu cient Labeled Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Hoppe</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tabea Tietz</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Dess</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nils Meyer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirjam Sprau</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehwish Alam</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Baden-Wurttemberg State Archives</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Deutsche Digitale Bibliothek</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>German Federal Archives</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>15</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insu cient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.6</p>
      </abstract>
      <kwd-group>
        <kwd>Dataless Categorization</kwd>
        <kwd>Text Categorization</kwd>
        <kwd>Document Exploration</kwd>
        <kwd>Cultural Heritage</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Exploring cultural heritage data is inevitably connected to searching through
historical records in archives. This task is complicated because of the huge amount
of documents stored in hierarchical le systems. In fact, the retrieval of
relevant information usually requires signi cant human e ort. Hence, the demand
for topic-based categorization and more diverse exploration methods, like
visual exploration, increases with the growing number of electronically available
archival records. Despite the ongoing trend of prominent digitization campaigns
the majority of archival objects so far are neither available in digital form nor
transcribed. Thereby, their content often is inaccessible and only descriptive
metadata can be used to categorize and organize them. For most of the
electronically available archival documents merely a title has been registered probably
together with identi er and archive le system information. As a consequence,
numerical vector representations required by modern categorization systems face
several challenges if applied to archival data:
{ Document titles provide only short texts which are insu cient to capture
semantics due to the amount of data and natural language ambiguity issues.
{ Archival objects are organized in a hierarchical le system which is ignored
by current representations.
{ Document understanding requires extensive world knowledge, like historical
context information.
{ The annotation of data is hindered by disagreements of experts about the
detailed interpretation of data, and the ne-grained domain-speci c
information need. Consequently, only insu cient training data is available.</p>
      <p>This paper addresses these shortcomings by exploring the use of word
embeddings and TF-IDF as a way to introduce external knowledge, and interpret
the semantics of document titles and category labels within a dataless
categorization approach on German archive data. Furthermore, the paper introduces a
visualization which provides graphical exploration of the archive when the
categorization approach is unfeasible. The contribution of this paper is three-fold:
{ We propose a dataless categorization approach for German archive data
based on vector representations.
{ We include the archive structure in our vector representations to improve
categorization results.
{ We provide a visual exploratory tool that can be used to retrieve documents
and support their annotation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In the Digital Humanities community the use of semantic technologies has
attracted a fair amount of interest in order to make the retrieval and exploration
of digitized archives easier. Nevertheless, the use of semantic representations in
this eld has not been fully investigated yet. In fact, previous works were often
based on supervised classi cation, e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used various classi ers for
supporting historians in enhancing their work, or topic modeling methods such as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
where the authors categorized a collection of 24; 787 archive documents with
100 topics. However, these methods usually rely on complete digitized
documents where each document provides a large amount of text. In case of short
texts these approaches fail, which makes them unfeasible to use for metadata
of archival resources. The sparsity problem for short text categorization is
addressed by current deep learning approaches. For example, to learn a context
within short texts, neural network-based systems such as Convolutional Neural
Networks (CNN) were proposed [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, such kind of approaches usually
need a lot of labeled data to train models, and often deal with a small number of
xed classes. Recent methods deal with insu cient training data by considering
external knowledge about the categories, e.g., category labels are used to
infer their semantics. In particular, these approaches exploit a vector space model
where both texts and categories are represented and compared by similarity
measurements [
        <xref ref-type="bibr" rid="ref1 ref7">1,7</xref>
        ]. One model, KBSTC [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed that this dataless categorization
approach provides reasonably good results for short texts by using entity
embeddings. However, these methods are applied on English datasets, as a contrast,
the archive documents considered in this study are in German. Inspired by this
kind of methodologies, we investigate general-purpose embeddings for archival
objects. This analysis considers the unique structure of German archival
documents and a higher number of classes as compared to other datasets used for
evaluating dataless approaches.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset of Archival Holdings</title>
      <p>
        Our dataset was collected by two digitization projects of the German Federal
Archives and the Baden-Wurttemberg State Archives on the so-called Weimar
Republic, the rst German democracy [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The ongoing decade of anniversaries
related to the Weimar Republic increases the current demand by historical
researchers and the general public to nd historical documents of the time. Over
the last few years, the archives have selected a large number of relevant archival
holdings from ministries, public institutions, corporate bodies and particular
individuals from this period to be digitized and described, which cover aspects
of politics, economy, society and everyday life in Germany from 1918 to 1933.
The collection is composed of 21; 042 documents and 799 categories de ned by
domain experts. Only 9% (2; 011 documents) are manually annotated with 59%
of all categories occurring at least once. The titles are on average 7 words long,
which does not provide enough contextual information for capturing the
semantics of a document. The documents are organized in a le system, e.g., the
document public welfare for the poor is part of the document welfare for war
victims and survivors, which is than again part of the document supply a airs,
etc. On average one document is a part of 4 higher level documents.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>This section details our approach to support the categorization of archival data.
4.1</p>
      <sec id="sec-4-1">
        <title>Word Embeddings Generation</title>
        <p>
          The data consists of German text, therefore, new word embedding models were
trained using Skip-gram Word2Vec [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and FastText [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on a dump of all German
Wikipedia articles7. The references and link sections of Wikipedia articles were
removed. The word embeddings were trained with 300 dimensions, a window of
5 words and 10 negative samples.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Dataless Categorization Approach</title>
        <p>Our approach is illustrated in Fig. 1. The input is a document title dj and a set
of categories C. The approach is subdivided in the following four modules.</p>
        <sec id="sec-4-2-1">
          <title>7 https://dumps.wikimedia.org/dewiki/ retrieved on 20.11.2019</title>
          <p>Context Selector. This module extracts the context of dj as a sequence of its
ancestors, starting with the parent document dp, and traversing up the archive
le system by recursion, i.e., kdj = (dp; kdp ) = (dp1; : : : ; dpm).</p>
          <p>Vectorization. This module transforms the input textual representations into
vectors through the use of word embeddings or TF-IDF. For example, consider
a document title dj . First, it removes all stop words, yielding d0j . Then, when
it is set up to use word embeddings, it creates the vector representation !dj by
applying (1), where n denotes the number of words of d0j , and w!i;j denotes the
word embedding of the i-th word. When the TF-IDF mode is set up, the module
assign a numerical value for each word in d0j building the TF-IDF vector !dj .
Context integration. This module adapts the hierarchical information of the
document context into the vector representation. More precisely, based on the
context sequence kdj , an exponential weighting schema of the corresponding
document vector representation k!dj is calculated according to equation (2), where
!
m is the length of the context sequence, dpi denotes the vector representation
of the i-th context object, and w denotes a hyperparameter to determine the
importance of the context. The weights insure that document vectors are scaled
based on their relative position in the hierarchy.</p>
          <p>Top k selector. After the featurization process, both the document k!dj and
categories !C are represented by vectors which can be used to detect their
semantic similarity. For doing so, this module employs the cosine similarity, and
yields the top k categories C0 C as prediction.</p>
          <p>!dj = n1 Pn</p>
          <p>i=1 w!i;j
(1)
k!dj = !dj + Pim=1 w1 i d!pi
(2)
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>This section reports our preliminary results of the proposed approach, and
discusses the open challenges related to the archive document categorization.
Experimental setup. The weighing factor for the context embeddings is set
to w = 1:1. Based on the average number of assigned categories per document
in the gold standard of 3:16 the classi cation parameter k is set to 5.
Dataless results. The results of our dataless approach are reported in Table 1.
The approach does not obtain relevant results in terms of precision, recall, and
f-measure. The TF-IDF representation method obtains the best overall
performance and outperforms the semantic representations when the title and context
were considered. The ne grained classi cation task highlights small di erences
between speci c words, which are neglected within a semantic space. For
example o cer and navy o cer are di erent categories with a high cosine similarity
for semantic embeddings, but a low cosine similarity for TF-IDF. Consequently,
it is easier to di erentiate between both categories within the TF-IDF space.
In addition, the high dimensional space of TF-IDF is better suited to store the
di erent aspects gathered by combining the title and context. Overall, Table 1
shows that 1) the context of documents plays an important role for achieving
better performance 2) basic semantic representations are not su cient to solve a
ne-grained dataless classi cation task. However, considering the high number of
possible categories (799) the results are encouraging , because a purely random
assignment of the average number of categories would achieve a f-measure
baseline of 0.004. Also, by manually revising the results, many suggested categories
matter for the input document. For example, the input title \public welfare for
the poor " is assigned to the categories unemployment bene ts and orphan welfare.
A considerable issue of this task is that archivists when classifying documents
use their experience and expertise. It is a challenge to include this background
knowledge in any automated process and most-likely this categorization task will
always need humans in the loop to deliver satisfactory results.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Visual Exploration</title>
      <p>
        In this section, we brie y introduce how the vector representations can be used
to visualize the search space through the Embedding Projector tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This tool
enables the interactive visualization of embeddings and utilizes PCA and t-SNE
to perform dimensionality reduction and plots these representations as a point
cloud. Additionally, it depicts the nearest neighbours of a selected embedding
and provides functionalities to search based on metadata and restrict the plotted
points to a speci c subset. The visualization is available online8. It supports
      </p>
      <sec id="sec-6-1">
        <title>8 http://vocol-ise. z-karlsruhe.de/</title>
        <p>further research on text categorization by presenting the document arrangement
within a graphical space to domain experts. It can be used to evaluate the vector
representations of documents and categories as well as to support the manual
labeling process by enabling the search of similar categories, e.g. a domain expert
nds the more speci c category ` u' by looking for neighbours of the generic
category `disease'. This provides a possibility to improve the accuracy of the gold
standard by pointing to semantically similar, but less frequently used categories.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>In this paper preliminary results of a semantic-based approach on dataless
categorization to support the activities of archivists in exploring and annotating
archive documents is presented. Moreover, the rst version of a visual exploratory
tool for supporting the manually exploration and annotation tasks is introduced.
Both proposed methods are based on semantics captured by training vector space
models and can be used in many unsupervised settings. Future work will enhance
the vector representations by exploiting taxonomic relations that occur between
categories and integrate external resources such as DBpedia and authority les.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <issue>1</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Importance of semantic representation: Dataless classi cation</article-title>
          .
          <source>In: Aaai</source>
          . vol.
          <volume>2</volume>
          , pp.
          <volume>830</volume>
          {
          <issue>835</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning word vectors for 157 languages</article-title>
          . arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>06893</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hengchen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coeckelbergs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Hooland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Exploring archives with probabilistic models: Topic modelling for the valorisation of digitised archives of the european commission</article-title>
          .
          <source>In: IEEE Int. Conf. on Big Data</source>
          . pp.
          <volume>3245</volume>
          {
          <issue>3249</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Herrmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zahnhausen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Auf dem Weg zum Digitalen Lesesaal: Das Projekt 'Weimar { Die erste deutsche Demokratie'</article-title>
          .
          <source>In: Kulturelles Kapital und okonomisches Potential. Zukunftskonzepte fur Archive</source>
          .
          <volume>86</volume>
          .
          <string-name>
            <given-names>Deutscher</given-names>
            <surname>Archivtag</surname>
          </string-name>
          .
          <article-title>Verband deutscher Archivarinnen und Archivare e</article-title>
          .V. (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Smilkov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thorat</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicholson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reif</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viegas</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Embedding projector: Interactive visualization and interpretation of embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1611.05469</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On dataless hierarchical text classi cation</article-title>
          .
          <source>In: Twenty-Eighth AAAI Conference on Arti cial Intelligence</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sprugnoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tonelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Novel event detection and classi cation for historical texts</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>45</volume>
          (
          <issue>2</issue>
          ),
          <volume>229</volume>
          {
          <fpage>265</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Turker, R., Zhang,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Koutraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sack</surname>
          </string-name>
          , H.:
          <article-title>Knowledge-based short text categorization using entity and category embedding</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <volume>346</volume>
          {
          <fpage>362</fpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>C.L.</given-names>
            ,
            <surname>Hao</surname>
          </string-name>
          , H.:
          <article-title>Semantic expansion using word embedding clustering and convolutional neural network for improving short text classi cation</article-title>
          .
          <source>Neurocomputing</source>
          <volume>174</volume>
          ,
          <issue>806</issue>
          {
          <fpage>814</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>