=Paper=
{{Paper
|id=Vol-3110/paper7
|storemode=property
|title=Graph Technologies for the Analysis of Historical Social Networks Using Heterogeneous
              Data Sources
|pdfUrl=https://ceur-ws.org/Vol-3110/paper7.pdf
|volume=Vol-3110
|authors=Sina Menzel,Mark-Jan Bludau,Elena Leitner,Marian Dörk,Julián Moreno-Schneider,Vivien Petras,Georg Rehm
}}
==Graph Technologies for the Analysis of Historical Social Networks Using Heterogeneous
              Data Sources==
<pdf width="1500px">https://ceur-ws.org/Vol-3110/paper7.pdf</pdf>
<pre>
 Graph Technologies for the Analysis of
   Historical Social Networks Using
     Heterogeneous Data Sources

    Sina Menzel∗1            Mark-Jan Bludau∗2              Elena Leitner∗3
   Marian Dörk2           Julián Moreno-Schneider3             Vivien Petras1
                                  Georg Rehm3
           ∗ The authors contributed equally to this work as first authors.


                         1 Humboldt-Universität zu Berlin
                  2 FH Potsdam – University of Applied Sciences
    3 DFKI – Deutsches Forschungszentrum für Künstliche Intelligenz GmbH


                                   Abstract
  Over the last decades, cultural heritage institutions have provided
  extensive machine-readable data, such as bibliographic and archival
  metadata, full-text collections, and authority records containing mul-
  titudes of implicit and explicit statements about the social relations
  between various types of entities. In this paper, we discuss how ap-
  proaches to the creation and operation of advanced research infrastruc-
  ture for historical network analysis (HNA) based on heterogeneous
  data sources from cultural heritage institutions can be examined and
  evaluated. Based on our interdisciplinary research, we describe chal-
  lenges and strategies with a special focus on the issue of data processing,
  sketch out the advantages of human-centered project design in the
  form of a preliminary co-design workshop, and present an iterative ap-
  proach to data visualization.


        Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: Tara Andrews, Franziska Diehr, Thomas Efer, Andreas Kuczera and Joris van Zun-
dert (eds.): Graph Technologies in the Humanities - Proceedings 2020, published at
http://ceur-ws.org.


                                        124
1        Introduction
The study of historical events is relevant to many disciplines in the digital hu-
manities, with the analysis of relationships between agents often being cru-
cial for the understanding and explanation of social, political, and cultural
phenomena. Given that historical research is heavily dependent on informa-
tion from the respective time period, the combination of as many historical
sources as possible is essential for the reconstruction of historical networks
– and this is where the method of historical network analysis (HNA) comes
into play. Derived from social network analysis, HNA is characterized by the
same dependency on numerous historical sources that ideally support each
other (Jansen and Wald, 2007).
   One limiting factor in HNA can be a lack of awareness with regard to
the availability of suitable research data. At the same time, over the past
decades, cultural heritage institutions have produced very large amounts of
machine-readable and, in many cases, standardized and well-organized data
in the form of bibliographic and archival metadata, full-text collections, and
sets of authority or reference records. These datasets contain a plethora of
implicit and explicit statements about social relations, which can in turn be
exploited for HNA research. However, systematically combining multiple
data sources (not to mention extracting and visualizing the complex resulting
networks) currently requires extensive knowledge in graph theory as well as
time-consuming manual work carried out by the individual researcher. One
reason for this is the heterogeneity of the data sources made available by cul-
tural heritage institutions, for example, in terms of data formats.
   The research project SoNAR (IDH): Interfaces to Data for Historical So-
cial Network Analysis and Research1 addresses this issue. We examine and
evaluate approaches to the development and operation of HNA-supporting
research infrastructure based on heterogeneous cultural heritage data. In
this paper, we present a number of preliminary insights related to the pro-
cess of modeling and transforming heterogeneous data sources, and to the
design of user-centered visualization for historical social networks. By shar-
ing our approach and its accompanying challenges, we aim to contribute to
the ongoing discussion on the suitability of bibliographic big data for HNA
and the development of corresponding research technologies.

2        Related Work
The following section gives an overview of previous research, and discusses
projects related to graph modeling and visualization approaches within the
    1
        https://sonar.fh-potsdam.de


                                      125
digital humanities from the perspective of historical network analysis.

2.1    Related Projects
In recent years, open knowledge graphs have frequently been used as an al-
ternative to a document-based approach (Auer and Mann, 2019). Several
large-scale initiatives such as EOS,2 Europeana,3 and CLARIN4 provide re-
searchers in the digital humanities with access to cultural data. Meanwhile,
the issue of decentralized and heterogeneous bibliographic data sources is
being addressed by projects such as Culturegraph (Vorndran, 2018) and
DARIAH-DE5 in the digital humanities, Lynx6 in the legal domain, and, to
a certain extent, ELG7 in language technology (Rehm et al., 2020). Most of
these initiatives connect to infrastructures of cultural heritage institutions,
often hosted by libraries or archives.
   Even though these initiatives provide, among other things, access to new,
previously unidentifiable or implicit information, they do not primarily fo-
cus on the extraction of network data. Therefore, HNA researchers are often
left to create their own individual graphs after gathering data that is suitable
to address their research question(s), in many cases using open source soft-
ware tools such as Gephi,8 Palladio,9 or VennMaker,10 .
   Along with the establishment of network analysis as a method in historical
research, there has been an increase in joint research projects that are focused
on the extraction of historical networks within the social sciences and the
humanities.
   For example, the project Six Degrees of Francis Bacon11 applies statistical
methods to the base data with the goal of inferring relations that permit the
reconstruction and visualization of historical social networks in Early Mod-
ern Britain. The project allows for the expansion and curation of the data
through collaborative annotation by the users (Warren et al., 2016). The
histoGraph12 project follows a similar approach by offering users an oppor-
tunity to collaboratively explore and research historical social networks by
   2
     European Open Science Cloud, https://www.eosc-portal.eu
   3
     https://www.europeana.eu
   4
     Common Language Resources and Technology Infrastructure, https://www.clarin.eu
   5
     https://de.dariah.eu
   6
     http://www.lynx-project.eu
   7
     https://www.european-language-grid.eu
   8
     https://gephi.org
   9
     https://hdlab.stanford.edu/palladio/
  10
     https://www.vennmaker.com
  11
     http://www.sixdegreesoffrancisbacon.com
  12
     http://histograph.eu


                                        126
means of extensive multimedia collections, with a special focus on crowd-
sourced indexation (Novak et al., 2014). In a joint project involving several
European research institutions, Issues with Europe – A Network Analysis
of the German-Speaking Alpine Conservation Movement (1975-2005)13 is
currently examining the disputes over European alpine transit policy, while
the Austrian project APIS – Mapping historical networks has been working
on the extraction and visualization of networks from more than 18,000 re-
cords in the Austrian Biographical Encyclopedia.14 Finally, the German pro-
ject Gesellschaftliche Wissensproduktion in der Aufklärung – Text- und net-
zwerkanalytische Diskursrekonstruktion considers full texts of more than
300 periodicals published in Halle, Germany between 1688 and 1815, and
combines the methods of topic modeling with historical network analysis in
order to systematically analyze public discourse during the Age of Enlight-
enment (Purschwitz, 2018).
   These are only a few examples of the ongoing efforts to provide users with
direct access to networks in existing data collections. In our project, we are
working with data sources that have not been modeled for HNA before. Our
generic data approach is closely connected to similar projects, like the North
American cooperative SNAC – Social Networks in Archival Context15 and
the French project PIAAF,16 , which both have a strong focus on archival
metadata and full texts.

2.2     Network Visualization
As far as the visualization of data for HNA is concerned, many interfaces
have been developed over the years that offer explorative, web-based net-
work visualization tools for historical network analysis. Examples include
the above-mentioned Six Degrees of Francis Bacon (Warren et al., 2016) and
histoGraph (Novak et al., 2014), as well as Visualizing the Republic of Let-
ters (Chang et al., 2009), Kindred Britain,17 and Deutsche Biographie.18
    Graph visualization is an extensive field in itself, which is accompanied
by a substantial body of literature on issues such as graph-related algorithms
(e. g. Gibson et al., 2012; Jacomy et al., 2014; Behrisch et al., 2016), task tax-
onomies for graph visualization (e. g. Lee et al., 2006; Ahn et al., 2013; Ker-
racher et al., 2015), state-of-the-art visualization interaction techniques and
   13
      https://www.uibk.ac.at/projects/issues-with-europe/index.html.en
   14
      Österreichisches Biographisches Lexikon, https://apis.acdh.oeaw.ac.at
   15
      https://snaccooperative.org
   16
      Pilote d’interopérabilité pour les autorités archivistiques françaises https://piaaf.demo.
logilab.fr
   17
      http://kindred.stanford.edu
   18
      https://www.deutsche-biographie.de


                                             127
developments (e. g. van Ham and Perer, 2009; von Landesberger et al., 2011;
Pienta et al., 2015), as well as the use of visual facilitators for the construction
of graph queries (e. g. Pienta et al., 2017). Nevertheless, existing research and
taxonomies mostly address the wider field of graph visualization. More of-
ten than not, visualizations and digital practices are not specifically adapted
to the requirements of HNA research or established data practices in the hu-
manities, and are ill-suited to address issues such as uncertainty, subjectivity,
or observer-dependence (Drucker, 2011).

2.3   Human-Centered Design
A key element in the examination and development of a new research infra-
structure designed for human-computer interaction is how well it meets the
needs of the people it is intended to assist. This human-centered approach
is closely related to Grounded Theory, which generates inductive results by
means of sociological methods (Glaser and Strauss, 1967).
    Isenberg et al. (2008) adapted Grounded Theory for the evaluation of in-
formation visualizations. They suggest iterative evaluation throughout the
process of system development using several points of qualitative inquiry to
ensure the focus of a system’s intended use, including field research to ex-
amine potential contexts of human interaction with the system. In keep-
ing with this argument for grounded evaluation, the neuralgic points for
evaluation in our project are based on Munzner’s nested model for visual-
ization design and validation (Munzner, 2009), which allows for iterative
improvement of the prototypes. The stages of evaluation include the assess-
ment of possible use cases, and the investigation of the problems and data
of a particular user domain at the top level. In order to better address such
issues, it is becoming more and more common to include domain experts
in the creation process of digital humanities-related projects. This kind of
co-creation is precisely what Chen et al. (2014) attempted to foster with a
workshop, wherein the participants were asked to create collages to make
sense of a photo archive with the aim of creating collection-sensitive inter-
faces. Henry and Fekete (2006) used a similar participatory approach in the
development of a tool for the exploration of social networks: they invited
social science researchers to create paper prototypes, which in turn led to a
list of domain requirements for their tool and resulted in a prototype with
novel features. A thorough evaluation of such co-creation methods, conduc-
ted in a co-design process with social science researchers, found that domain
experts in general appreciate their additional empowerment in the process
and the domain-customized results based on their specific needs. Neverthe-
less, regarding their personal involvement and necessary time commitment,


                                       128
some participants did not perceive their personal involvement as beneficial
for the facilitation of their own research (Molina León and Breiter, 2020).
Besides the use of co-design techniques, there is also a shift from perceiving
visualizations as mere tools for humanities-related research towards the ac-
knowledgment of visualization and visualization processes as a methodology
and facilitator of cross-disciplinary research in and of itself (Hinrichs et al.,
2019). While we have noticed increased attention to the method of HNA,
to the best of our knowledge, there has so far been little investigation of the
modeling and visualization of (bibliographical) big data for this purpose.

3        Data Sources
The interdisciplinary project SoNAR (IDH), which studies the potential of
large heterogeneous data collections for HNA, includes partners from the
fields of historiography, information visualization, and artificial intelligence,
as well as computer and information science. This variety of disciplines
opens different perspectives on the requirements and challenges connected
to the use of heterogeneous (meta)data for HNA. What distinguishes our
approach is the synchronous operation of all components of the project, so
that the design of the data technology, the development of a model research
design for HNA, and the development of innovative visualization and inter-
face approaches with the involvement of HNA experts are all intertwined
and influence one another.
   The project is based on heterogeneous source data from authority files,
bibliographic records, and full texts. The data is available in various XML-
based formats such as MARC21 (Kruk et al., 2005), EAD (Allison-Bunnell,
2016), and METS/ALTO19 (Cantara, 2005):
    • The Integrated Authority File (GND)20 represents and describes
      8,295,047 entities (people, corporations, conferences, geographical
      areas, technical terms, and works);
    • The German National Library (DNB)21 provides descriptions of bib-
      liographic resources. The dataset has 19,926,573 records of books,
      magazines, newspapers, sheet music, music recordings, audio books
      etc.;
    • The German Union Catalogue of Serials (ZDB)22 describes newspa-
      pers, magazines, serial titles, yearbooks, etc. and contains 1,908,334 re-
      cords;
    19
       http://www.loc.gov/standards/alto
    20
       https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html
    21
       https://www.dnb.de/EN/Home/home_node.html
    22
       https://zdb-katalog.de/index.xhtml


                                               129
    • The Kalliope Union Catalog (KPE)23 is a collection of personal pa-
      pers, manuscripts, and publishers’ archives, which consists of 26,752
      records;
    • The Newspaper Information System (ZeFYS)24 represents 2,596,641
      digitized pages of historical newspapers and full texts;
    • The Exile Press25 represents German-language exile journals between
      1933 and 1945 and consists of 5,336 digitized pages.
   Since the source data – describing entities (authority files) and resources
(bibliographic files) – is encoded in various formats, these formats must first
be analyzed in order to enable the design of an appropriate data model and
allow their transformation into a uniform, generic format. Full texts are pre-
pared for automatic enrichment (i. e. named entity recognition and linking)
and converted to a corresponding format.

4        Data Processing
In this section, we will give an overview of the data transformation and graph
modeling process, and outline the challenges that we have encountered along
the way.
   The technical goal of our project is the integration of the various source
datasets into a common research infrastructure. We currently use the graph
database Neo4j,26 which is well suited to the efficient storage and high-
performance analysis of large amounts of highly networked information
(Efer, 2016; Matschinegg and Nicka, 2018; Wintergrün, 2019). Entities are
modeled as nodes and relations as edges with absolute and relational features.
   There are a total of 9 entity types extracted from the source data:
    1. Person PerName;
    2. Corporate body CorpName;
    3. Place or geographic name GeoName;
    4. Conference or event MeetName;
    5. Subject heading TopicTerm;
    6. Work UniTitle;
    7. Temporal information ChronTerm;
    8. Information about ISIL27 IsilTerm;
    9. Resource Resource.
    23
       https://kalliope-verbund.info/en/index.html
    24
       http://zefys.staatsbibliothek-berlin.de/index.php?id=start&L=1
    25
       https://www.dnb.de/EN/Sammlungen/DEA/Exilpresse/exilpresse_node.html
    26
       https://neo4j.com
    27
       International Standard Identifier for Libraries and Related Organizations


                                            130
    Six entity types (i. e., person PerName, corporate body CorpName, place or geo-
graphic name GeoName, conference or event MeetName, subject heading TopicTerm,
and work UniTitle) are taken from the corresponding classes of the authority
files. Bibliographic entities are represented as Resource. We added two types to
this list: ChronTerm, which describes temporal information encoded in entity
types from authority files; and IsilTerm, which is used to identify the librar-
ies related to other entity types. Each entity has general features, such as a
unique source identifier, URI, name, link, etc., and specific features, such as
age, gender, coordinates, etc. Furthermore, there are also nine relation types
that correspond to entity types, such as RelationToPerName, RelationToCorpName,
RelationToGeoName. Relations between entities include information about the
relation source, relation source type, information about temporal validity,
and additional information (Figure 1).

4.1      Data Model
While the relations between entities are explicitly described in authority files,
relations between actors such as persons or corporate bodies that are identi-
fied or defined in the resource are only implicitly encoded in bibliographic
files. Our aim is to automatically infer these implicit relations with the assist-
ance of a set of strict guidelines (e. g. a connection between two persons can
be assumed if both are co-authors of a scientific publication), and to make
them available as explicitly encoded data. In order to derive corresponding
relation types, the role of actors regarding a specific resource (e. g. as author,
editor, or addressee) and the resource type (bibliographic files of primary
sources of the Kalliope Union Catalog and of secondary sources of the Ger-
man National Library and the German Union Catalog of Serials) are to be
taken into account. Using this approach, we were able to infer additional
relations (but marked them as computed), for instance between co-authors,
co-publishers, and authors/addressees, to further enrich the data.
    In order to prepare full texts for analysis, named entities are automatically
recognized, disambiguated, and linked to their associated authority files (e. g.
the Integrated Authority File or Wikidata28 ). Next, relations between detec-
ted entities are automatically recognized, added to the graph database, and
connected with their respective full texts, represented as nodes.

4.2      Challenges and Solutions
Overall, the authority and bibliographic files used by us contain approxim-
ately 30 million records that describe entities and resources in detail. As was
to be expected, normalization of the data revealed a number of errors and
  28
       https://www.wikidata.org


                                       131
Figure 1: Data modeled in Neo4j. Persons are shown in blue, locations in green, sub-
ject headings in light brown, works in pink, ISILs in purple, temporal expressions
in red, and resources in light blue.


                                       132
inconsistencies. In this section, we would like to describe some particularly
problematic areas in more detail and suggest possible solutions.
    We have modeled and transformed data for the graph database in such
a way that identifiers are used as coordinates for relations between entities.
In the Integrated Authority File, entities with old identifiers were found, so
that an appropriate connection of two entities was not possible. The first
challenge was to detect old identifiers and replace them with valid ones in
order to enable error-free representation. All replacements were written in a
log file. However, during a consistency check we also found relations to en-
tities within the source data that were without identifiers. Since such entities
could not be clearly assigned to existing entities with identifiers, ambiguous
relations of this kind had to be ignored.
    Information that was encrypted in internal codes in the Integrated Au-
thority File, the German National Library, and the German Union Cata-
logue of Serials (in format MARC21) was also checked for codes of general
and specific entity types, codes of relation types between an agent and a re-
source, and country codes. Further examinations were performed on the
consistency of entity names, resource titles, and identifiers. Again, all errors
or inconsistencies were written in a log file.
    Building on the conclusions that we were able to draw from testing Neo4j,
we decided to adapt the data model to our needs. In order to simplify search-
ing and filtering according to temporal dimension, time information from
the source data was adjusted. First, while retaining the source data, we addi-
tionally separated time intervals, noted as “begin” and “end.” Second, in or-
der to facilitate more performant visualization and querying of the data, we
added a feature to resource descriptions that reflects the year of publication
(in addition to the publication date). Thirdly, differing time expressions in
MARC21 and EAD were normalized.
    We also decided to change gender-specific names of professions. These
are represented in the Integrated Authority File as two different entities with
their own identifiers, male and female. Conceptually, however, what we are
dealing with is a single entity with two versions, so these versions must be
merged in the graph database and represented as a node. One challenge is
to adequately display all information from the two versions without making
the search more difficult. In this case, we are currently still looking for a
suitable solution.

5    Co-Design Workshop
In accordance with the principles of grounded evaluation (Isenberg et al.,
2008), we aim to closely integrate domain experts into the data modeling


                                     133
and visualization process. The presence of HNA experts in our project team
means that all internal decisions that are made take the domain perspective
into account. Additionally, the inclusion of external domain experts is an-
other integral part of our research design. Conducting studies with research-
ers from various fields allows us to iteratively improve the project’s outcome.
   At the beginning of the project, it was important to us to stimulate dis-
cussions on the potential of bibliographic (meta)data for HNA, and on the
requirements for the visualization of historical networks. Following the ap-
proach proposed by Chen et al. (2014) and Henry and Fekete (2006), we or-
ganized a co-design workshop that included domain experts in order to help
identify key aspects and gain new insights into historical network research
and visualization.

5.1   Procedure
The workshop consisted of ten participants, including four historical/social
network practitioners as domain experts, two project-internal information
visualization designers/engineers, two members of our project-internal eval-
uation team, one member of our team of data scientists (responsible for the
data transformation), and an external participant who had a background in
design and previous experience with the co-design format. The interdiscip-
linary composition of the group was intended to enrich the discussion by
offering a multitude of perspectives on the topic of HNA through the lens
of HNA experts, with fresh insights being provided by participants from
other (project-relevant) fields. Since we aim to develop an infrastructure for
HNA that can be used by researchers from all disciplines working with this
method, the participation of experts from fields other than history was espe-
cially welcome.
   The workshop was scheduled for three hours in total. As suggested by
Fekete and Plaisant (2002), we started off with a brief presentation of various
recent developments in the field of network visualization, including some of
the more novel and experimental approaches.
   We started the process of conceptualizing network visualizations with a
short, hands-on visualization exercise, during which the participants were
asked to visualize a very small social network (ten nodes) based on a data
matrix we provided. After this warm-up, we gave a short introduction to the
goals of our project and the data we are using. The participants were then
asked to create a collage depicting possible approaches to HNA research with
our specific data and project in mind (see Figure 2). For the collages, we
supplied a variety of materials (e. g. construction paper, pencils and mark-
ers, sticky notes). While Chen et al. (2014) provided visual material from


                                     134
  Figure 2: Selected collages created in the interdisciplinary co-design workshop

their photographic collection, our data is more abstract and less visual. To
compensate for this, we printed out and distributed further visual material
including an empty map, various icons (e. g. as representations of network
nodes), and a small number of scans from our full-text data sources. We then
introduced several questions to help initiate the creative process (e. g. “How
would you like to move through the data?” and “What role do data dimen-
sions such as time, space, or semantic relationships play?”), but encouraged
the participants to feel free to disregard them. The task we had in mind was
not to create wireframe sketches for a concrete user interface, but to envision
desired functionalities as well as general approaches and entrance points to
HNA research and our data.
   After about 30 minutes, each of the collages was discussed. First, the parti-
cipants not involved in making a collage were asked to interpret and speculate
about what they were seeing. Afterwards, the creators of the collages were
asked to give explanations and discuss their approach with the group. In
this step, the almost inevitable misinterpretations were meant to foster fur-


                                       135
ther discussion and novel ideas. In the final step, each participant was asked
to give a closing statement recapitulating the most important insights from
the process and the most prominent topics or themes in the discussion.
    For further analysis and documentation, the entire workshop was audio-
recorded and photographed. The audio recordings were transcribed and en-
coded in a tool for qualitative data analysis. This allowed us to assess various
qualitative aspects of the workshop discussions at a later date. As pointed
out above, the goal of our workshop was not to create functional wireframes
or concrete interaction principles, but to stimulate discussion, foster sensib-
ility towards the domain and data, and highlight important domain-specific
research aspects and challenges. The following section will discuss some of
the most relevant insights concerning our visualization process.

5.2   Results
We noticed two different types of statement. On a more abstract level, the
participants expressed various information needs that commonly arise in the
process of their research. In some cases, however, the conversation and the
collages yielded very concrete ideas regarding possible features of an HNA
infrastructure that would address these needs. As mentioned before, the lat-
ter were not regarded as direct assignments to be fulfilled in the visualization
process, but rather as indicators for the participant’s general receptiveness
towards various properties of the user interface of an HNA infrastructure.
Table 1 and 2 summarize the main aspects of the workshop discussions in
the form of needs and features.
           Need                  Number of Mentions Persons
           New perspectives                 30                  7
           Uncertainty                      25                  7
           Data potential                   27                  4
           Graph density                    26                  4
           Entry points                     17                  4
           Data explanations                16                  4

Table 1: List of the most frequently expressed needs with the count of their men-
tions during the workshop and the number of persons (n=10) referring to them

   The most pressing topic in the discussions was the envisioned user ap-
proaches and use cases the infrastructure is expected to support. Seven of
the ten participants expressed the hope that the visualizations could gener-
ate new perspectives, thereby creating forms of access to the data that would
hardly be available based on non-machine-supported cognitive work. In this


                                      136
               Feature                   Number of Mentions Pers.
               Timeline                             22               4
               Tie metrics                          18               4
               Other filters                        16               3
               Export and citation                  13               4
               Location filter                      8                3
               Source linking                       6                4

Table 2: Most frequently desired features with the count of their mentions during
the workshop and the number of persons (n=10) referring to them

context, one participant explicitly emphasized the potential of visualizations
to raise new questions:

        What kind of relationships you are looking for in the data is something
        you often notice in the very moment that you look at the pile for the
        first time.29

   Since the participants were aware of the fact that we are confronted with
a very large amount of data, which can hardly be presented in its entirety
(see Section 3), a discussion of possible entry points emerged. There was
consensus on the importance of filter options, most importantly time filters:

        Without timelines, the visualizations are of no use to me – neither for
        analysis, nor for the presentation of results.

   In addition to timelines, other filters (e. g. node type and node source)
were considered a prerequisite for data exploration. Three participants also
mentioned the importance of location filters (e. g. through a map view).
   Participants with more HNA experience explicitly stressed the essential
role of a multi-layered approach. The capacity to display the evolution of
relations (e. g. through time and location) was described as the distinctive
feature of HNA when compared to the non-historical analysis of social net-
works. The sole option of static display was considered insufficient.
   Along with possible entry points, another important topic raised in the
discussion was data complexity, with introductions and explanations regard-
ing the underlying data being identified as particularly crucial. Some parti-
cipants suggested addressing this issue with the help of concrete use cases that
could give potential users a more specific idea of the possibilities afforded by
the HNA infrastructure.
  29
       All quotes translated from German into English.


                                           137
   About half of the participants cited the ability to quantify network char-
acteristics as graph metrics during the research process as one of their main
motivations for using HNA methods. This includes indicators such as the
clustering coefficient, closeness centrality, degree distribution, degree cent-
rality, and betweenness centrality. Four participants also considered dens-
ity within a selected sample of nodes to be a relevant indicator for a given
dataset’s potential for network analysis. After the first cluster of possible ap-
proaches had been discussed, one participant highlighted the added value of
graph metrics when it comes to the identification of anomalies in the data:

     What all these things are actually about is that we are looking for pat-
     terns!


   Some participants also stressed the potential of tie metrics to accommod-
ate a variety of relation types, and expressed the desire to have the weight of
edge properties visualized:

     It is of course a big difference whether you are a family member […] or
     whether you are a correspondence partner or whether you met at a con-
     gress during a coffee break. These are all relationships, but of course
     they have different weights in their interpretation. This is, for example,
     something we would like to see in the visualization.


   This statement is representative of another central topic discussed in the
workshop, namely the visual marking of missing or uncertain information
in the data which can, for example, be the result of inconsistencies in the
metadata fields (see Section 4.2). The design expert considered this to be a
major desideratum:

     I think this is not done enough in current visualizations to show uncer-
     tainties of data.


   With regard to the scientific standards of HNA research, a final major
issue was the export and citation of the visualizations. This, of course, re-
quires unambiguous and persistent provenance links to the source of each
data point, as well as timestamps of the corresponding data import.
   Many of the results of our co-design workshop match the challenges in
information visualization discussed in the pertinent literature. In the fol-
lowing section, we will draw on these results to describe our prototyping
approach and process.


                                       138
6    Visual Prototyping
The dataset of our project is comprised of a number of elements that go well
beyond what can be perceptually or cognitively grasped at a glance. When
it comes to encoding, for example, the sheer amount of nodes and relations
poses technological as well as visual challenges (Fekete and Plaisant, 2002;
Shneiderman, 2008). While some potential users of our technology might
have a very specific research question in mind, others might be inclined to-
wards a more serendipitous approach (Thudt et al., 2012), or may wish to use
such an infrastructure in order to formulate research questions. Our aim is
to provide access points for a broad variety of motivations and research ques-
tions, including ones that we cannot as of yet anticipate. Therefore, the con-
ceptualization of a visual representation as an access point to our data in the
form of a data exploration interface can be described by a wide and diverse
range of challenges and difficulties:
    • How can tens of millions of nodes and hundreds of millions of edges
      be visualized?
    • What are possible and meaningful entrance points to the data?
    • How can we deal with uncertainty, missing data, and varying data
      sources?
    • How can we deal with multiple data dimensions?
    • How can we provide a technology that is complex and open enough for
      a broad range of undefined research questions, but simple enough for
      casual use?
    • How can we be transparent with regard to the algorithms used?
    • How can users move between overviews, detail views, and egocentric
      views?
   Even though our workshop, our conversations with domain experts, and
existing task taxonomies (e. g. Lee et al., 2006; Kerracher et al., 2015; Ahn
et al., 2013) have already yielded a multitude of potential tasks, needs, and
requirements that should be addressed in our graph technology, we see the
prototyping process as a form of research through design (Zimmerman et al.,
2007) that is not only capable of confirming these requirements, but also of
unveiling new ones. Moreover, in contrast to the above-mentioned task tax-
onomies, we are engaging with humanities-related data and research ques-
tions – a field, where traditional visualization approaches are often deemed
incompatible with the nature of the objects of inquiry Drucker (2011).
Along with the data modeling process and the co-creation approaches de-
scribed above, our visualization process can thus be described as a form of
rapid, experimental, and iterative prototyping process and data exploration.


                                     139
Figure 3: Two small design studies. Left: visualizing levels of uncertainty in edges
between nodes by using waves and varying levels of frequency. Right: concept for
handling of multiple edges between two nodes. In the initial view, multiple edges
are combined into one (marked as the red line) to reduce the overall complexity of a
graph. A click fans out the individual edges on demand, visually transitioning from
one line to multiple arcs.

Compared to the potentially shortest path to a finished ‘tool,’ our method
resembles a curiosity-driven ‘sandcastling’ (Hinrichs et al., 2019). We un-
derstand experimental approaches and detours in the visualization process
itself as a methodology of knowledge production. By following this route,
visualizations are not necessarily created with the goal of implementing them
in a final prototype or concept. Rather, they become a method for explor-
ing the data or individual facets of the data, a tool for investigating the ba-
sic challenges of data or their encoding, or a visual facilitator for encour-
aging cross-disciplinary communication and the development of novel and
thought-provoking approaches (Hinrichs et al., 2019).
    From the beginning, the entire project has been conducted in an interdis-
ciplinary and concurrent mode, without any delays between its individual
steps; data processing, case study development, visualization, and evaluation
all occur alongside one another. Initially, the data was neither processed for
visualization, nor was it accessible via some form of API, which only allowed
us to work with small subsets of selected data. While this made it difficult
to anticipate all of the facets and challenges associated with handling the full
extent of the data, working with data subsets early on gave us the ability to
exert iterative influence on the data processing and the data model.
    Instead of trying to combine all potential features and ideas into a single
prototype, our approach focuses on small, separate problems and ideas
through a multitude of rough prototypes. Many of our design studies or
prototypes have been developed in close collaboration with our own HNA
specialists, and/or draw extensively on input from our workshop or other ex-
ternal sources of expertise, whereas others are more experimental in nature,


                                       140
Figure 4: Prototype overviews of a specific data facet (in this case, topic terms re-
lated to persons), based on a selected year. A Voronoi map displays distributions of
topic terms connected to persons alive in a selected year. Orange represents female-
gendered terms.

and are often the product of spontaneous impulses. For the most part, the
following examples were designed with the data visualization library D3.js
(Bostock et al., 2011), which permitted the development of customized visu-
alizations.
    Figure 3, for instance, shows two small design studies from the beginning
of the project, without using real data: the one to the left is a visualization of
levels of relation uncertainty, while the one to the right represents the testing
of an interaction concept with the goal of reducing complexity by merging
multiple edges and allowing users to fan them out on demand.
    As an example of the influence exerted by visualization on the data model,
an early prototype which clusters persons in a small subset of the data based
on related topic terms – in most cases occupational titles – revealed that these
titles are frequently gendered30 in our base data (GND), which means that
men and women are often not related to the same topic term, even though
they practice the same profession. This unexpected differentiation in the
data is highly relevant when it comes to search queries and visualization,
since it is quite possible that some researchers do not differentiate by gender,
and only use the male form that was traditionally considered to be generic.
One effect of this differentiation in the data can be seen in another interact-
ive prototype (see Figure 4), where it is possible to select a specific year in the
data with a slider, visualizing top topic terms related to persons who were
alive in the selected timespan (female-gendered topic terms are colored in or-
ange). The goal of this prototype was to explore the potential of overviews
to reveal aspects of the data that might, at a later point, act as entry points

  30
     Many German occupational titles are gendered and exist in a male and a female form,
as with the English ‘actor’ and ‘actress.’


                                         141
Figure 5: Experimental prototype that enables scrolling through time by means of a
UMAP projection of a small subset of our data, which arranges persons on the basis
of similarity across topic terms. Color and the sagittal (z) axis are used to encode
temporal closeness of a node in relation to a selected year (in this example, nodes
that lie inside the selected year 1869 are colored in yellow).

for specific search interests.
   Another experimental prototype (see Figure 5) of a small subset of our
data also focuses on topic terms and the temporality of the data; an aspect
frequently highlighted as important by some of our HNA experts in the
workshop. Here, the dimensionality reduction technique UMAP (McInnes
et al., 2020) was used to map persons with similar topic term relations in
close proximity to each other, effectively forming clusters for certain occu-
pational domains (e. g. authors). A timeline on the right displays the general
distribution of all nodes, while a list next to it contains all connected topic
terms, ordered by occurrence. Scrolling enables users to move through the
temporal dimension of the network, creating the impression of a time tunnel.
Nodes belonging to a selected year are displayed in yellow. Temporally close
nodes in the past appear more distant from the viewer and are marked in red
tones, while those that lie in the future are colored in green and blue tones,
and appear to be closer. One insight gained with the help of this prototype
was that our data model and processing approach once again needed to be
adjusted to make the data more accessible for use in visualizations, especially
with regard to temporal filtering.
   In some cases, as with Figure 6, we developed prototypes out of curiosity
for very specific research questions, for example: “Are network communit-
ies in the data subset mostly composed of contemporary nodes or do com-
munities stretch over multiple generations?” Here, the prototyping process
allowed us to test specific algorithm implementations and design strategies,


                                       142
Figure 6: Prototype overviews of node relations to reveal relations and community
clusters over time. First, a community algorithm is applied to the graph data. Then,
nodes are ordered and colored based on the community algorithm results, and are
placed on a timeline based on their dates of birth and death.

while at the same time being able to obtain deeper insights into the data.
   While our research is still in progress, the experiences mentioned above il-
lustrate the benefits of staying curious and open to experimentation through-
out the analysis and visualization process. Even though many ideas and con-
cepts are inspired by existing research in the field and, of course, the expertise
of our domain specialists, we see additional value in experimenting with the
data and generating a multitude of visual representations, even if this means
knowingly taking detours. It is precisely these more experimental pathways
that can lead to new ideas for tools, or generate fresh insights into the data.
The prototypes are non-incremental steps towards a final concept, iteratively
informed by feedback from our domain experts and other potential future
operators.

7    Conclusion and Future Projects
The converging of multiple heterogeneous data sources containing millions
of nodes and edges for a graph-based research infrastructure that enables his-
torical social network analysis creates a plethora of multidisciplinary chal-
lenges:
    • Difficulties associated with the merging of heterogeneous data sources
    • Performance of a system regarding the given scope and further scaling
      of the data
    • Creation of domain-customized interfaces, which are open and flexible
      with regard to unforeseen research questions


                                       143
   • Integration of domain knowledge into the process
   • Visualization of millions of data points to provide explorable access
     points in addition to search interfaces
   We address these challenges by focusing on tight, interdisciplinary collab-
oration and constant evaluation during the whole research and development
process. Our approach brings together historical network specialists, data
visualization researchers, data scientists, and experts on the evaluation of in-
formation infrastructure, an important example being the initial co-design
workshop with additional external HNA practitioners and other domain ex-
perts. Building on the contextual data gathered during the co-design work-
shop, we will continue to follow a human-centered approach towards data
modeling and visualization design.
  In our next step, we aim to take a closer look at the individual processes
behind historical network research in one-on-one interviews with domain
experts concerning their approaches to HNA research. Our plans for the
future also include the merging of multiple visualization concepts into one
prototype, which will join global overviews of our data with local views of
specific individual networks inside it. Furthermore, we will make use of our
data and our interface to provide exemplary use cases on a variety of historical
topics in collaboration with our HNA experts.
   Finally, we are experimenting with linked data as an alternative to Neo4j.
Here, the source data would be modeled in the form of subject–predicate–
object expressions and stored in GraphDB.31 This approach would simplify
the integration of Linked Open Data datasets (Wikidata, DBpedia,32 Geo-
Names,33 etc.), and would provide more sophisticated inference possibilities.
In preliminary comparisons of the two approaches, GraphDB also shows bet-
ter performance, but employing it would mean that the source data must be
remodeled in order to display relation features such as relation type, relation
source type, and temporal validity.
   In this paper, we have described the process of examining the potential
of remodeling and merging (bibliographic) big data from cultural heritage
institutions into one single gathering point optimized for the use in histor-
ical network analysis. It is our hope that by providing insights into emerging
challenges and outlining possible solutions, we can encourage additional re-
search and scholarly exchange in and with similar HNA-related projects.


  31
     http://graphdb.ontotext.com
  32
     https://wiki.dbpedia.org
  33
     https://www.geonames.org


                                     144
Acknowledgements
We would like to thank the participants of our co-design workshop and our
project partners Heiner Fangerau, Katrin Getschmann, Thorsten Halling,
Hans-Jörg Lieder, Gerhard Müller, Clemens Neudecker, David Zellhöfer,
and Josefine Zinck. This research is part of the research project SoNAR
(IDH) and is funded by the DFG – German Research Foundation (project
no. 414792379).

References
Ahn, J.-w., Plaisant, C., and Shneiderman, B. (2013). A Task Taxonomy
  for Network Evolution Analysis. IEEE transactions on visualization and
  computer graphics, 20(3):365–376, DOI: 10.1109/TVCG.2013.238.

Allison-Bunnell, J. (2016). Review of Encoded Archival Description Tag
  Library – Version EAD3. Journal of Western Archives, 7(1):1–3, DOI:
  10.26077/af62-2a86.

Auer, S. and Mann, S. (2019). Towards an Open Research Know-
  ledge Graph.            The Serials Librarian, 76(1-4):35–41, DOI:
  10.1080/0361526X.2019.1540272.

Behrisch, M., Bach, B., Henry Riche, N., et al. (2016). Matrix Reordering
  Methods for Table and Network Visualization. In Computer Graphics
  Forum, volume 35, pages 693–716. Wiley, DOI: 10.1111/cgf.12935.

Bostock, M., Ogievetsky, V., and Heer, J. (2011). D3 Data-Driven Doc-
  uments. IEEE transactions on visualization and computer graphics,
  17(12):2301–2309, DOI: 10.1109/TVCG.2011.185.

Cantara, L. (2005). METS: The Metadata Encoding and Transmission
  Standard. Cataloging & classiﬁcation quarterly, 40(3-4):237–253.

Chang, D., Ge, Y., Song, S., Coleman, N., Christensen, J., and Heer, J.
  (2009). Visualizing the Republic of Letters. https://web.stanford.edu/group/
  toolingup/rplviz/papers/Vis_RofL_2009.

Chen, K.-I., Dörk, M., and Dade-Robertson, M. (2014). Exploring the
  Promises and Potentials of Visual Archive Interfaces. In iConference 2014
  Proceedings, pages 735 – 741. DOI: 10.9776/14348.

Drucker, J. (2011). Humanities Approaches to Graphical Display. DHQ:
  Digital Humanities Quarterly, 5(1):1–21.


                                    145
Efer, T. (2016). Graphdatenbanken für die textorientierten e-Humanities.
  PhD thesis, Universität Leipzig, https://nbn-resolving.org/urn:nbn:de:bsz:
  15-qucosa-219122.

Fekete, J. and Plaisant, C. (2002). Interactive Information Visualization of
  a Million Items. In IEEE Symposium on Information Visualization, IN-
  FOVIS 2002., pages 117–124.

Gibson, H., Faith, J., and Vickers, P. (2012). A Survey of Two-Dimensional
  Graph Layout Techniques for Information Visualisation. Information
  Visualization, 12(3-4):324–357.

Glaser, B. and Strauss, A. (1967). The Discovery of Grounded Theory.
  Weidenfield & Nicolson, London.

Henry, N. and Fekete, J.-D. (2006). Matrixexplorer: a Dual-Representation
 System to Explore Social Networks. IEEE Transactions on Visualization
 and Computer Graphics, 12(5):677–684.

Hinrichs, U., Forlini, S., and Moynihan, B. (2019). In Defense of Sand-
  castles: Research Thinking through Visualization in Digital Humanities.
  Digital Scholarship in the Humanities, 34(1):i80–i99.

Isenberg, P., Zuk, T., Collins, C., and Carpendale, S. (2008). Grounded
   Evaluation of Information Visualizations. In Proceedings of the 2008
   Workshop on BEyond Time and Errors: Novel EvaLuation Methods for
   Information Visualization, pages 1–8. DOI: 10.1145/1377966.1377974.

Jacomy, M., Venturini, T., Heymann, S., and Bastian, M. (2014). ForceAt-
   las2, a Continuous Graph Layout Algorithm for Handy Network Visu-
   alization Designed for the Gephi Software. PloS one, 9(6):e98679, DOI:
   10.1371/journal.pone.0098679.

Jansen, D. and Wald, A. (2007). Netzwerktheorien. In Benz, A.,
  Lütz, S., Schimank, U., and Simonis, G., editors, Handbuch Gov-
  ernance: Theoretische Grundlagen und empirische Anwendungsfelder,
  pages 188–199. VS Verlag für Sozialwissenschaften, Wiesbaden, DOI:
  10.1007/978-3-531-90407-8_14.

Kerracher, N., Kennedy, J., and Chalmers, K. (2015). A Task Taxonomy for
  Temporal Graph Visualisation. IEEE transactions on visualization and
  computer graphics, 21(10):1160–1172.


                                   146
Kruk, S. R., Synak, M., and Zimmermann, K. (2005). MarcOnt – Integra-
  tion Ontology for Bibliographic Description Formats. In International
  Conference on Dublin Core and Metadata Applications, pages 231–234.

Lee, B., Plaisant, C., Parr, C. S., Fekete, J.-D., et al. (2006). Task Taxonomy
  for Graph Visualization. In Proceedings of the 2006 Workshop on BEyond
  Time and Errors: Novel Evaluation Methods for Information Visualiza-
  tion, pages 1–5. DOI: 10.1145/1168149.1168168.

Matschinegg, I. and Nicka, I. (2018).    REALonline Enhanced. Die
 neuen Funktionalitäten und Features der Forschungsbilddatenbank des
 IMAREAL. MEMO, 2:10–32, DOI: 10.25536/20180202.

McInnes, L., Healy, J., and Melville, J. (2020). UMAP: Uniform Mani-
 fold Approximation and Projection for Dimension Reduction. arXiv,
 1802.03426, https://arxiv.org/abs/1802.03426.

Molina León, G. and Breiter, A. (2020). Co-creating Visualizations: A First
 Evaluation with Social Science Researchers. Computer Graphics Forum,
 39(3):291–302, DOI: 10.1111/cgf.13981.

Munzner, T. (2009). A Nested Model for Visualization Design and Val-
 idation. IEEE Transactions on Visualization and Computer Graphics,
 15(6):921–928, DOI: 10.1109/TVCG.2009.111.

Novak, J., Micheel, I., Melenhorst, M., Wieneke, L., et al. (2014). His-
 toGraph – A Visualization Tool for Collaborative Analysis of Net-
 works from Historical Social Multimedia Collections. In 18th Inter-
 national Conference on Information Visualisation, pages 241–250. DOI:
 10.1109/IV.2014.47.

Pienta, R., Abello, J., Kahng, M., and Chau, D. H. (2015). Scalable Graph
   Exploration and Visualization: Sensemaking Challenges and Opportunit-
   ies. In 2015 International Conference on Big Data and Smart Computing
  (BIGCOMP), pages 271–278. DOI: 10.1109/35021BIGCOMP.2015.7072812.

Pienta, R., Hohman, F., Tamersoy, A., Endert, A., et al. (2017). Visual
   Graph Query Construction and Refinement. In SIGMOD ’17: Proceed-
   ings of the 2017 ACM International Conference on Management of Data,
   pages 1587–1590. DOI: 10.1145/3035918.3056418.

Purschwitz, A. (2018). Netzwerke des Wissens – Thematische und person-
  elle Relationen innerhalb der halleschen Zeitungen und Zeitschriften der


                                     147
  Aufklärungsepoche (1688–1818). Journal of Historical Network Research,
  2(1):109–142, DOI: 10.25517/jhnr.v2i1.47.

Rehm, G., Berger, M., Elsholz, E., Hegele, S., et al. (2020). European Lan-
  guage Grid: An Overview. In Calzolari, N., Béchet, F., Blache, P., Cieri,
  C., et al., editors, Proceedings of the 12th Language Resources and Evalu-
  ation Conference (LREC 2020), pages 3359–3373. https://www.aclweb.org/
  anthology/2020.lrec-1.413/.

Shneiderman, B. (2008). Extreme Visualization: Squeezing a Billion Re-
  cords into a Million Pixels. In SIGMOD ’08: Proceedings of the 2008
  ACM SIGMOD international conference on Management of data, pages
  3–12. DOI: 10.1145/1376616.1376618.

Thudt, A., Hinrichs, U., and Carpendale, S. (2012). The Bohemian
  Bookshelf: Supporting Serendipitous Book Discoveries through Inform-
  ation Visualization. In CHI ’12: Proceedings of the SIGCHI Confer-
  ence on Human Factors in Computing Systems, pages 1461–1470. DOI:
  10.1145/2207676.2208607.

van Ham, F. and Perer, A. (2009). ”Search, Show Context, Expand on
  Demand”: Supporting Large Graph Exploration with Degree-of-Interest.
  IEEE Transactions on Visualization and Computer Graphics, 15(6):953–
  960, DOI: 10.1109/TVCG.2009.108.

von Landesberger, T., Kuijper, A., Schreck, T., Kohlhammer, J., et al.
  (2011). Visual Analysis of Large Graphs: State-of-the-Art and Future Re-
  search Challenges. Computer Graphics Forum, 30(6):1719–1749, DOI:
  10.1111/j.1467-8659.2011.01898.x.

Vorndran, A. (2018). Hervorholen, was in unseren Daten steckt! Mehrwerte
  durch Analysen großer Bibliotheksdatenbestände. o-bib. Das offene Bib-
  liotheksjournal, 5(4):166–180, DOI: 10.5282/o-bib/2018H4S166-180.

Warren, C. N., Shore, D., Otis, J., Wang, L., Finegold, M., and Shalizi, C.
 (2016). Six Degrees of Francis Bacon: A Statistical Method for Recon-
 structing Large Historical Social Networks. DHQ: Digital Humanities
 Quarterly, 10(3), DOI: 10.17613/M6B020.

Wintergrün, D. (2019).         Netzwerkanalysen und semantische Daten-
 modellierung als heuristische Instrumente für die historische Forschung.
 PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg, https:
 //nbn-resolving.org/urn:nbn:de:bvb:29-opus4-111899.


                                   148
Zimmerman, J., Forlizzi, J., and Evenson, S. (2007). CHI ’07: Proceedings
  of the SIGCHI Conference on Human Factors in Computing Systems.
  In Proceedings of the SIGCHI conference on Human factors in computing
  systems, pages 493–502. DOI: 10.1145/1240624.1240704.


                                  149

</pre>