=Paper=
{{Paper
|id=Vol-2717/paper07
|storemode=property
|title=The Helsinki Digital Humanities Hackathon
|pdfUrl=https://ceur-ws.org/Vol-2717/paper07.pdf
|volume=Vol-2717
|authors=Ruben Ros,Sarah Oberbichler
|dblpUrl=https://dblp.org/rec/conf/dhn/RosO20
}}
==The Helsinki Digital Humanities Hackathon==
<pdf width="1500px">https://ceur-ws.org/Vol-2717/paper07.pdf</pdf>
<pre>
                    The Helsinki Digital Humanities Hackathon:
                  Two Perspectives on Multidisciplinary Historical
                   Newspapers Research in a Hackathon Context.

                       Ruben Ros1[0000−0002−5303−2861] and Sarah Oberbichler2[0000−0002−1031−2759]
                                        1
                                            Universiteit Utrecht, Utrecht, The Netherlands
                                                         ruben@rubenros.nl
                                            2
                                              Universität Innsbruck, Innsbruck, Austria
                                                 sarah.oberbichler@uibk.ac.at


                            Abstract. This paper describes the 2019 edition of the Helsinki Digital
                            Humanities Hackathon from the perspective of two of its participants. As
                            (digital) historians they were part of the group that investigated the his-
                            tory of medical advertisements in British nineteenth-century newspapers.
                            The paper describes the research process, as well as the data and meth-
                            ods used during the research. The paper also considers the Hackathon as
                            a laboratory for Digital Humanities research and reflects on the nature
                            of the collaboration as experienced during the Hackathon. As such, the
                            paper describes the challenges of multidisciplinary research and identifies
                            the factors that hinder and foster collaboration in a Digital Humanities
                            context.

                            Keywords: Hackathon · Multidiscilinary Collaboration · Historical News-
                            papers · Text Mining.


                  1        Introduction
                  In this Twin Talk we discuss concept of the Helsinki Digital Humanities Hackathon
                  (DHH) and the research done during the Hackathon from the perspective of two
                  of its participants (two historians familiar with computational methods). We aim
                  to show how the Hackathon concept brings together researchers and offers an
                  excellent opportunity for exploring the potential of multi- and interdisciplinary
                  Digital Humanities research. We report on the research process, including the
                  methods we used, the questions we answered and challenges we faced. Hereby,
                  this paper also reflects on the nature of collaboration in Digital Humanities re-
                  search. We will argue that specifically shared vocabularies, ”bridge-building”
                  capacities of individual researchers and leadership make a difference in promot-
                  ing and facilitating multidisciplinary research.3
                      The Helsinki DHH is organized yearly by the Helsinki Centre for Digital
                  Humanities (HELDIG)[1]. Five Hackathons have been organized so far. The
                   3
                       Copyright © 2020 for this paper by its authors. Use permitted under Creative
                       Commons License Attribution 4.0 International (CC BY 4.0).
                   4
                       This work has been supported by the European Union’s Horizon 2020 research and
                       innovation programme under grant 770299 (NewsEye).


Twin Talks 2 and 3, 2020            Understanding and Facilitating Collaboration in Digital Humanities    66/143
                  2          R. Ros & S. Oberbichler

                  Helsinki DHH offers a chance to experience an interdisciplinary research project
                  from start to finish within the span of 1.5 weeks. For researchers and students
                  with a computer science or data science background, the Hackathon gives the
                  opportunity to test abstract knowledge against complex historical problems. For
                  people from the humanities and social sciences, the DHH shows the potential of
                  computational methods and multidisciplinary research.
                      Every edition, around forty participants are divided into four groups. The
                  group themes are established in advance and the participants can indicate their
                  prefered group. This year, the group titles were: Newspapers & Capitalism, Genre
                  and Style in Early Modern Publications, The Many Voices of the European Par-
                  liament and Brexit in Transnational Social Media.
                      In earlier years the Hackathon was visited predominantly by Finnish and
                  European students and researchers. Recently, the Hackathon has widened its
                  scope with the help of Common Language Resources and Technology Infras-
                  tructure (CLARIN) and Digital Research Infrastructure for the Arts and Hu-
                  manities (DARIAH), and the Hackathon participants now include students and
                  researchers from all over the world.
                      The team whose work is reported in this paper focused on the history of
                  nineteenth century British newspapers and consisted of historians (7), linguists
                  (2), computer scientists (3), data scientists (1) and literary scholars (1).
                      In the remainder of this paper we describe the research in the Newspapers
                  & Capitalism group. We discuss the research topics and questions, as well as
                  the data and methods used in the research. Lastly, we reflect on the Hackathon
                  as a laboratory for DH-research and we identify factors that challenge and/or
                  promote multidisciplinary collaboration.


                  2        Research Topics and Questions

                  During the first days the group decided to develop several lines of research based
                  on the personal interests, individual expertise and academic background of the
                  group members. To develop research questions, the whole team got together
                  and collected ideas. This process was supported by literature research and close
                  reading of the newspapers. The group decided to focus on the topic of nineteenth-
                  century medical advertisements and the language of persuasion employed in
                  those advertisements.
                      Throughout the century, medical advertisements occupied an important place
                  in periodical culture [2][3]. Pills, lotions and ointments were regularly promoted,
                  not seldom by so-called “quacks”: charlatans who promised to cure every disease
                  imaginable. The link between sellers of patient medicines and the publishing in-
                  dustry was intimate. Without the steady demand for advertisement space many
                  newspapers would have gone bankrupt, and without constant advertising, the
                  patient medicine industry would not have been able to sustain itself [4].
                      In the early nineteenth century, the well-organized patent medicines indus-
                  try had replaced the small-scale quacks, who lacked the skills and resources to


Twin Talks 2 and 3, 2020           Understanding and Facilitating Collaboration in Digital Humanities   67/143
                                                        The Helsinki Digital Humanities Hackathon      3

                  take part in a market that became increasingly national in scope. The litera-
                  ture identifies four categories of nineteenth-century advertisers: market leaders,
                  tradesmen, medical practitioners, elite and locals [5]. During the Victorian pe-
                  riod there were four leading pill-makers who together represented most of the
                  newspaper advertising; James Morison, the creator of Universal Pills, a ‘vener-
                  able’ Salopian called Thomas Parr who sold “Parr’s Life Pills” to increase the
                  beauty of women, Thomas Holloway who is also identified as the first world-wide
                  advertiser, and Thomas Beecham who invented “Beecham’s Pills” and claimed
                  to cure “bilious and nervous disorders” [6]. During the nineteenth century, quack-
                  ery slowly disappeared from the newspaper pages as scientific insights reached
                  the broader public and legal action against health fraud was organized [7][8].
                  This transformation of medical advertising thus ties in with broader questions
                  about the rise of consumer society, the professionalization of medicine and the
                  newspaper industry itself.
                      For the group, the literature also provoked several questions, such as: what
                  diseases and cures were advertised, whether the advertisers distinguished be-
                  tween different (gendered) publics, and how the cures were rhetorically marketed
                  in the advertisements. These questions formed the basis for the Hackathon re-
                  search in the Newspapers & Capitalism group. They were chosen because they
                  invited for the use of computational methods, but also benefited from ‘tradi-
                  tional’ close reading.
                      Later in the research process, several other group members explored another
                  significant group of advertisements: job ads. Using the HISCO classification of
                  jobs, they were able to shed light on the long-term evolution of jobs that were
                  marketed in advertisements [9].


                  3        Data

                  The project used the British Library Nineteenth Century Newspapers collection,
                  provided by Gale Cengage. The data consist of 304 unique newspapers, 270.744
                  issues, 26.295.841 articles and 1.560.916 advertisements. The data was indexed
                  by the HELDIG team and made accessible through an API. In light of the
                  limited amount of time, the group chose to focus on a specific newspaper:“The
                  Morning Post”, a British newspaper that was successfully established in 1772 and
                  existed until 1937 [10]. The Morning Post advertisements, that ran throughout
                  the century, were used for the questions on diseases/cures, persuasion and gender.
                  The statistical analysis was applied to all newspapers in the dataset.
                      The newspapers were accessible in machine-readable form and through an
                  API developed by the HELDIG researchers. Since the quality of the photographed
                  pages varies, the quality of the so-called ”Optical Character Recognition” (OCR)
                  as provided by Gale does so as well. Especially in the first two decades of the
                  nineteenth century, the OCR confidence levels (included in the metadata) proved
                  to be relatively low (Figure 1). Another pressing problem was the article seg-
                  mentation. Because advertisements appear in all forms and sizes, it is hard to
                  draw boundaries between advertisements and articles, and between individual


Twin Talks 2 and 3, 2020          Understanding and Facilitating Collaboration in Digital Humanities       68/143
                  4         R. Ros & S. Oberbichler

                  advertisements. In the process of segmenting newspaper pages, advertisements
                  are often grouped together, resulting in low-quality segmentation. In order to
                  reduce the effect of lacking segmentation and low-quality OCR the analysis used
                  issue-based frequency, meaning that for example the frequency of a word was
                  not normalized by the number of advertisements in a newspaper but by the
                  number of newspaper issues. In response to lower OCR confidence levels in the
                  first decade of the century, keyword frequency was broadened by taking into
                  account potential “alternatives” that were automatically generated with word
                  embeddings (“fchool”, “schoof, when searching for “school”) [11].


                  Fig. 1. Mean OCR Confidence of News articles (Orange) and Advertisements (Green)
                  in British newspapers.


                  4        Methods and Research Process
                  After defining the research questions and research topics, the group divided
                  itself in subgroups that would work on specific tasks. First, two computer sci-
                  entists set out to gather metadata statistics. Three aspects were particularly
                  important: the ‘quality’ of the data as measured with OCR (Optical Charac-
                  ter Recognition) confidence levels, the changing composition of newspapers as
                  measured through the share of categories such as “Advertisements and Notices”
                  or “Business News”, and, lastly, the original locations of the newspapers. The
                  statistical insights were presented in Tableau, an online interface that allowed
                  the group to explore the results.
                      Informed by the features of the data as presented in Tableau, a group of
                  historians and data scientists set out to investigate advertisements by using fre-
                  quency analysis. First, they inquired into the relation between advertisements
                  and gender by creating subsets of male- and female-oriented ads. The histori-
                  ans manually composed a vocabulary of gender-related keywords (e.g. “male”,
                  “female”, “boy”, “girl”) and a vocabulary of cures (e.g. “pills”, “ointments”,


Twin Talks 2 and 3, 2020          Understanding and Facilitating Collaboration in Digital Humanities   69/143
                                                      The Helsinki Digital Humanities Hackathon      5

                  “balsam”). These vocabularies were used to extract a subset of medical adver-
                  tisements from the digitized Morning Post and to extract ads that mentioned
                  gender-related terms. Ads that did not mention any of the vocabulary terms were
                  excluded. The subset that emerged from this pipeline showed how the ratio of
                  male- and female-oriented ads changed significantly during the century (Figure
                  2).
                      The group then investigated the relation between gender and medicine by
                  extracting cures and diseases from the ads. To do so, the group employed word
                  embeddings, a method used to embed words in a ’vector space’ that is subse-
                  quently used to inquire into semantic relations between words [12]. Embeddings
                  were trained on samples from the Morning Post for every decade by using the
                  popular Python library word2vec [13]. This allowed the group to filter words
                  that were semantically close to words such as “disease”, “cough” and “pain”. In
                  this way, a typology of diseases could be created. A similar method was used to
                  extract the advertised cures.
                      By combining the extracted information on cures and diseases with the vo-
                  cabularies of gender-related keywords, the ads could be classified into groups
                  of male- and female-oriented ads. This allowed us to inquire into two differ-
                  ent aspects of nineteenth-century medical advertising. First, we could track the
                  changes in the target audience. Especially later in the century, female-oriented
                  ads became more present in the corpus. Also, we could investigate whether differ-
                  ent cures and diseases appeared in advertisement aimed at different publics. By
                  looking separately at male- and female-oriented ads, we were able to find differ-
                  ences between the targeted diseases. Mental illnesses, for example, were slightly
                  more associated with women. We related this trend to existing literature on the
                  history of female “hysteria” [14].
                      These data-driven insights were verified by historians in the group who sys-
                  tematically collected samples of newspapers through the Gale Search Engine
                  and its keyword search interface [15]. For each decade, twenty newspapers were
                  subjected to close reading. This allowed the group not only to complement the
                  frequency analysis, but also pointed at potential problems, such as ads that
                  were not specifically aimed at men or women, or ads that consisted primarily of
                  images.
                      A similar iterative approach was taken to the strand of research that looked
                  into the language of persuasion. The literature on the language of early-modern
                  advertising provided several insights into the linguistic features of advertising
                  in the eighteenth century [16,17,18]. Several linguists in the group investigated
                  these features in the nineteenth century, hereby focusing on modal verbs (w.g.
                  ”wishes highly to recommend”) and the use of repetition to draw attention
                  and testimony. During the research, it appeared that many of these fine-grained
                  rhetorical tropes were hard to quantify as a result of lacking article segmentation
                  and OCR-errors. For this reasons, the digital methods in this line of research
                  were restricted to the relative frequency of specific parts-of-speech such as modal
                  verbs and adjectives.


Twin Talks 2 and 3, 2020        Understanding and Facilitating Collaboration in Digital Humanities       70/143
                  6          R. Ros & S. Oberbichler

                      The last subquestion, on the changing prominence of occupations in job ad-
                  vertisements was investigated by using a list of English-language job titles and
                  comparing those to the full-text advertisements. Using the associated classifica-
                  tion codes as a way to cluster specific occupations together, we were able to shed
                  a light on the changing job market: low-skilled and production-related jobs rose
                  in prominence, a pattern that was also visible in the category of professional and
                  technical workers.


                  5        Results

                  The Hackathon research yielded surprising outcomes given the limited amount of
                  time. With regard to the medical advertisements, the changing proportion and
                  character of male- and female-oriented advertisements was a striking finding.
                  In the first decades of the nineteenth century, ads were mainly targeted at a
                  male audience. Gradually, female-oriented ads advanced. Especially in the last
                  two decades of the century, the proportion of those advertisements increased
                  significantly.


                  Fig. 2. The share of medical advertisements in the Morning Post containing male- and
                  female-related terms, 1800-1900.


                      The advertisements also differed in terms of their contents. Although dis-
                  tant reading techniques were not able to fully capture the complexity of this
                  issue, close reading revealed the gendered nature of specific illnesses and cures.
                  The investigation of persuasiveness in newspaper language similarly produced
                  surprising results. Computational analysis was not very important for the out-
                  comes, but the results invite for further analysis into for example modal verbs


Twin Talks 2 and 3, 2020           Understanding and Facilitating Collaboration in Digital Humanities    71/143
                                                       The Helsinki Digital Humanities Hackathon      7

                  and the repetition of specific product names. Lastly, the research into job ad-
                  vertisement made good use of the HISCO-classification and was able to gain an
                  insight in nineteenth-century transformations on the job market with relatively
                  simple methods.


                  6        Experiences

                  The Hackathon was first and foremost a learning experience. Humanists learnt
                  new skills and methods to do research computationally and computer/data sci-
                  entists were confronted with the complexities of messy historical data. Through-
                  out the week, the group formed a comprehensive understanding what kind of
                  multidisciplinary research could be done, and how such research should be de-
                  signed and executed. The group also learned about the challenges of collabora-
                  tion within an interdisciplinary projects; How can I communicate what I need
                  and what I want and how can I explain digital tools and methods to humanities
                  researchers?
                      If we look back at the Hackathon experience from the perspective of broader
                  questions about collaboration in Digital Humanities research one aspect stands
                  out. During the Hackathon, crossing disciplinary boundaries remained a constant
                  effort for all the researchers. Despite the favourable “geography of practice” and
                  the unique circumstances offered by the Hackathon format, practically all the
                  group members frequently ‘drifted’ back to methods and traditions that were
                  familiar to them [19]. Computer scientists and data scientists reverted to data
                  curation and visualization while for the humanists close reading the newspapers
                  and surveying the literature was something they could and would do easily. The
                  short time span pressured the group members to come up with clear results, and
                  resorting to familiar research practices was considered to be an easy way out.
                  For example, if the data harmonization or model training would take longer than
                  planned, going back to close reading was a tempting solution. The design of the
                  research questions to a large extent facilitated this reflex to return to familiar
                  territory. Because the questions posed could be answered by both computational
                  methods and close reading, splitting those tasks was tempting.
                      This dynamic of “disciplinary isolationism” surfaced frequently. Three fac-
                  tors can be identified as having a positive effect on promoting collaboration and
                  preventing this reflex of “methodological isolationism”. First, a shared vocab-
                  ulary: throughout the Hackathon, researchers visibly (and audibly) integrated
                  their methodological vocabularies. Initially, terms such as “modelling” and “data
                  cleaning” put the humanists at a distance from the computer scientists’ prac-
                  tices. Similarly, the latter group had a completely different understanding of the
                  language of hermeneutics employed by historians and linguists. Towards the end
                  of the Hackathon, however, terms such as “harmonization”, “close reading” and
                  “language models” were used by historians and computer scientists alike.
                      Second, a group of what could be called “bridge-builders”, “intermediaries”
                  or “translators” proved essential in facilitating and actually doing the research
                  [20][21][22]. A small number historians in the group who had worked in multi-


Twin Talks 2 and 3, 2020         Understanding and Facilitating Collaboration in Digital Humanities       72/143
                  8          R. Ros & S. Oberbichler

                  disciplinary research groups before or were proficient in coding and data man-
                  agement themselves formed important links between the disciplines. After two
                  to three days, a workflow emerged that evolved around these (coding) bridge-
                  builders who had access to the data and were able to translate historical ques-
                  tions into for example statistical tests. Whereas this inter-disciplinary group
                  members were important, the Hackathon also showed how vital steps in the
                  analysis could be easily outsourced to this limited number of “specialists”, ul-
                  timately preventing integration and collaboration. For example the mass-scale
                  extraction of diseases and cures from the digitized ads was an essential task, but
                  was done by programming historians.
                      Lastly, leadership was an vital factor in promoting collaboration. The group
                  leaders were especially important in the first stage of designing the research plan.
                  Because they had experience with DH-research, they were able to indicate the
                  limits of what was possible. During the research itself, the group leaders provided
                  focus. Because the Hackathon provided the participants with an full arsenal of
                  tools and method, it proved important to decide on what was not to be done.
                  Additionally, the group leaders were also the key links between the computer
                  scientists and the humanists. They differed from the earlier mentioned category
                  of ”bridge-builders” in the sense that they remained at a distance, not doing the
                  research themselves.


                  7        Conclusion
                  The Helsinki Digital Humanities Hackathon is a fascinating laboratory for Digital
                  Humanities research. Participants from a diverse range of backgrounds engage
                  in cutting-edge research. Professors and bachelor students, humanities schol-
                  ars and computer scientists worked together on the same research question and
                  learned from each other. The research done in the Newspapers & Capitalism
                  group crossed borders but the group also encountered challenges. The mixing
                  of the different disciplines was slow and it would have taken even more time
                  to establish collaboration that truly integrates existing traditions and methods.
                  Given the impressive results and the increase in mutual understanding, how-
                  ever, the Hackathon gives cause for optimism when it comes to the future of
                  multidisciplinary Digital Humanities research.


                  References
                  1. Helsinki        Centre        for      Digital         Humanities,        HELDIG,
                     https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities. Last accessed 20
                     Feb. 2020.
                  2. Hindley, D., & Hindley, G., 1972. Advertising in Victorian England, 1837-1901.
                     Wayland, London.
                  3. Church, R., 2000. Advertising consumer goods in nineteenth-century Britain: rein-
                     terpretations. Economic History Review 53.4, 621-645.
                  4. Carter, K. C., 1993. The concept of quackery in early nineteenth century British
                     medical periodicals. Journal of Medical Humanities 14.2, 89-97.


Twin Talks 2 and 3, 2020           Understanding and Facilitating Collaboration in Digital Humanities      73/143
                                                       The Helsinki Digital Humanities Hackathon          9

                  5. Mackintosh, A., 2017. The Patent Medicines Industry in Late Georgian England:
                     A Respectable Alternative to Both Regular Medicine and Irregular Practice. Social
                     History of Medicine 30. 1, 22–47.
                  6. Porter, R., 1989. Health for sale: quackery in England, 1660-1850. Manchester Uni-
                     versity Press, Manchester.
                  7. Barker, H., 2009. Medical advertising and trust in late Georgian England. Urban
                     History, 36.3, 379-398.
                  8. Sackville Turner, E., 1965. The Shocking History of Advertising, Great Britain,
                     61-64. Ballantine, New York.
                  9. Van Leeuwen, M. H., Maas, I., & Miles, A., 2002. HISCO: Historical international
                     standard classification of occupations. Leuven, Leuven University Press.
                  10. Hindle, W. H., 1937. The Morning post, 1772-1937: portrait of a newspaper. Rout-
                     legde, London.
                  11. Hládek, D., Staš, J., Ondáš, S., Juhár J., & Kovács, L., 2017. Learning string
                     distance with smoothing for OCR spelling correction. Multimedia Tools Application
                     76, 24549–24567. https://doi.org/https://doi.org/10.1007/s11042-016-4185-5
                  12. Mikolov, T., Chen, K., Corrado, G. and Dean, J.,2013. Efficient estimation of word
                     representations in vector space. arXiv:1301.3781v3 [cs.CL].
                  13. Kutuzov, A., Øvrelid, L., Szymanski, T. and Velldal, E., 2018. Diachronic word
                     embeddings and semantic shifts: a survey. arXiv:1301.3781v3 [cs.CL].
                  14. Arnaud, S., 2015. On Hysteria: The Invention of a Medical Category between 1670
                     and 1820. The University of Chicago Press, Chicago.
                  15. Gale Search Engine, http://gdc.galegroup.com. Last accessed 17 Feb 2020.
                  16. Leech, G. N., 1966. English in advertising: A linguistic study of advertising in
                     Great Britain. Longmans, London.
                  17. Gotti, M., 2005. Advertising discourse in eighteenth-century English newspapers.
                     In: Skaffari, J., Peikola, M., Carroll, R., Hiltunen, R. and Wårvik, B., 2005. Opening
                     Windows on Texts and Discourses of the Past, pp. 23-38. John Benjamins, Amster-
                     dam.
                  18. Percy, C., 2012. Early Advertising and Newspapers as Sources of Sociolinguistic
                     Investigation. In: Hernández-Campoy, J.M. & Conde-Silvestre, J.C., The Handbook
                     on Historical Sociolinguistics, pp. 191-211. Blackwell, Hoboken.
                  19. Kemman, M., 2019, Boundary Practices of Digital Humanities Collaborations. Dig-
                     ital Humanities Benelux 2019, 1, 1-24.
                  20. Edmond, J., 2005. The role of the professional intermediary in expanding the hu-
                     manities computing base. Literary and Linguistic Computing 20.3, 367-380.
                  21. Siemens, L., Cunningham, R., Duff, W., and Warwick, C., 2011. A tale of two cities:
                     implications of the similarities and differences in collaborative approaches within
                     the digital libraries and digital humanities communities. Literary and Linguistic
                     Computing 26.3, 335-348.
                  22. Edmond, J., Collaboration and Infrastructure. In: Schreibman, S., Siemens, R. and
                     Unsworth, J., 2015. A New Companion to the Digital Humanities. Wiley, Hoboken.


Twin Talks 2 and 3, 2020         Understanding and Facilitating Collaboration in Digital Humanities            74/143

</pre>