Peeking Inside the DH Toolbox – Detection and
Classification of So昀琀ware Tools in DH Publications
Nicolas Ruth, Andreas Niekler and Manuel Burghardt
Computational Humanities Group, Institute for Computer Science, Leipzig University – Augustusplatz 10, 04109
Leipzig


                                      Abstract
                                      Digital tools have played an important role in Digital Humanities (DH) since its beginnings. Accordingly,
                                      a lot of research has been dedicated to the documentation of tools as well as to the analysis of their impact
                                      from an epistemological perspective. In this paper we propose a binary and a multi-class classi昀椀cation
                                      approach to detect and classify tools. The approach builds on state-of-the-art neural language models.
                                      We test our model on two di昀昀erent corpora and report the results for di昀昀erent parameter con昀椀gurations
                                      in two consecutive experiments. In the end, we demonstrate how the models can be used for actual tool
                                      detection and tool classi昀椀cation tasks in a large corpus of DH journals.

                                      Keywords
                                      tool studies, so昀琀ware entity recognition, neural language models


1. Introduction: Tools in DH
In their ”notes toward an epistemology of building in the digital humanities”, Ramsay & Rock-
well [23] stress the importance of prototyping and tool building in the Digital Humanities (DH)
as a scholarly activity. Building on Vannevar Bush’s [6] early notion of machines as ”extensions
of the human mind” and J. C. R. Licklider’s [17] follow-up concept of a ”man-computer symbio-
sis”, the epistemological implications of digital tools have been discussed time and again [23, 15,
7, 5]. Besides this epistemological dimension, [13] note the importance of DH historiography.
They argue that a history of the DH cannot be written without documenting its respective tools
in a sustainable way. However, this has proven to be quite a di昀케cult undertaking, as tools are
o昀琀entimes rather short-lived and at the same time highly speci昀椀c for particular areas of appli-
cations or even project contexts. Furthermore, tools and so昀琀ware packages are o昀琀entimes not
documented at all, because the humanities still have a strong focus on bibliographic scholar-
ship and fail to capture the work of computing humanists [13]. As a result, tools are o昀琀en not
adequately cited in publications [14], which further complicates sustainable documentation.
   To address these problems, a number of resources have been developed to document tools


CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium
£ nr78riku@studserv.uni-leipzig.de (N. Ruth); andreas.niekler@uni-leipzig.de (A. Niekler);
burghardt@informatik.uni-leipzig.de (M. Burghardt)
ç https://ch.uni-leipzig.de/ (N. Ruth); https://ch.uni-leipzig.de/ (A. Niekler); https://ch.uni-leipzig.de/
(M. Burghardt)
ȉ 0000-0002-1126-6280 (N. Ruth); 0000-0003-1354-9089 (A. Niekler); 0000-0003-1354-9089 (M. Burghardt)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                     411
in DH. These include tutorials such as the renowned Programming Historian site1 , but also tool
directories, such as the Digital Methods Initiative2 or the SSH Open Marketplace3 . The oldest
and most extensive tool directory is TAPoR (Text Analysis Portal for Research). TAPoR started
out as a database for collecting primarily text analysis tools and recently has integrated another
tool directory named DiRT (Digital Research Tools), which also contains tools that go beyond
the text modality [13, 10]. Up to now, TAPoR 3.0 contains approx. 1,600 digital tools. Some
of these tools also have information about its type and function, which is largely based on
the TaDiRAH (Taxonomy of Digital Research Activities in the Humanities) categories [2]. While
TAPoR 3.0 is already pretty extensive, it is still far from being complete. This is also re昀氀ected by
a recent call for contribution, where DH scholars are asked to contribute more tools to TAPoR
3.0.4 Although a complete list of all digital tools ever developed in the context of DH will
probably never be produced, we believe that many of the current blind spots can be covered by
an automated approach that will eventually allow us to paint a more coherent picture of DH
tool historiography.
   In this paper we propose a binary and a multi-class classi昀椀cation approach based on neural
language models to detect and classify tools. For the detection, the binary classi昀椀er checks
if sequence has tool or sequence has no tool, while for the multi-class task the model tries to
assign one of seven TaDiRAH categories to a sequence that contains a tool. Taking up the
perspective of scientometrics, we believe that searching for tools in existing DH publications
is a good approximation of the most important tools actually used in DH research. While
TAPoR may be categorized as a more qualitative, crowdsourcing-like approach, where DH
scholars actively submit their tool candidates, our approach takes a more empirical, corpus-
based route. In this current paper we present experiments that are based on a corpus of three
established DH journals. However, we plan to use the approach on further DH publications
such as DH abstracts or journals articles from neighboring disciplines such as information
science or computational linguistics [20, 4].


2. Related Work: Tool Detection and So昀琀ware Entity
   Recognition
For the identi昀椀cation of so昀琀ware tools in academic publications, two main approaches can
be identi昀椀ed: (a) dictionary-based approaches and (b) machine learning approaches. Most
of the existing lexicon-based approaches use the TAPoR site as a basic dictionary to detect
tools mentioned in DH abstracts [1, 12], DH tutorials [11] and DH journal articles [5]. Some
of the key problems of dictionary-based approaches are the static list of tools and the great
number of false positives that are created by highly ambiguous tool names (for instance Python
or R). To overcome these limitations of tool dictionaries, machine learning approaches are a
promising alternative, for which we 昀椀nd some 昀椀rst attempts in the DH community already.

1
  https://programminghistorian.org/
2
  https://wiki.digitalmethods.net/Dmi/ToolDatabase
3
  https://marketplace.sshopencloud.eu/
4
  Tweet from tapordotca, April 22nd, 2022: https://twitter.com/tapordotca/status/1517564033519345664?s=21&t=M
  j6Hk76pigAxGtK8iaUuKA


                                                    412
[14] use a combination of named entity recognition frameworks and manual 昀椀ltering to detect
so昀琀ware citations in German DHd abstracts. [25] also train an NER approach, which gives
good results for a speci昀椀c area of application. Some 昀椀rst experiments on using Transformers
for the detection of tools in DH publications have been conducted by [5], who successfully use
BERT embeddings to expand a static list of tools. Outside the DH domain, there are also many
examples of pre-trained models for the So昀琀ware Entity Recognition task [19, 8], but most of
these come from the natural sciences and are therefore only moderately suitable for use in
Digital Humanities publications.
   In this paper, we present a new model based on the tansformer architecture RoBERTa [18] to
reliably detect tools in DH publications and beyond (will be referred to as binary model). Fur-
thermore, we enable the classi昀椀cation of tools into basic usage categories (analysis, capturing,
creation, etc.; will be referred to as multi-class model), something that none of the aforemen-
tioned approaches has considered so far.


3. Data
In this study, we will use two datasets for building and evaluating our models. The 昀椀rst corpus
(cor-1) consists of 3,737 English-language publications from three DH journals (Computers
and the Humanities, Digital Humanities Quarterly, Literary and Linguistic Computing/Digital
Scholarship in the Humanities) published between 1966 and 2020. We use cor-1 as our main
data input for the training and evaluation of the model. In addition, we use another existing
corpus (cor-2) that has been used by [26] to train a machine-learning approach for extracting
so昀琀ware mentions from scholarly articles. The corpus comprises approx. 1.9 million sentences
from ADHO DH conference abstracts (2015 and 2020) and PLOS ONE papers from Linguistics
and Sociology5 and is used as a model-external data source to test the generalization capabilities
of our model.
   cor-1: This dataset was compiled by the authors and contains no annotations. To generate
training, testing, and validation data from cor-1 we need to identify passages with tool men-
tions. A manual extraction of text passages in our data is not feasible given the amount of text
documents available. To obtain high-quality training data, we 昀椀rst searched for particularly
popular tools by a string based search. Popular tools were identi昀椀ed by going through exist-
ing lists and tutorials that document DH tools (see Table 1) and selecting only those tools that
were mentioned in at least two di昀昀erent lists. From these tools, we then removed those that
exhibited a high degree of ambiguity to further improve quality. This reduced lexicon, with a
total of 246 tools, served as the starting point for creating the training data. In the next step,
all papers in the corpus were tokenized and searched with the tool lexicon. Whenever a tool
was found successfully, the sequence was transferred to the training dataset as a text snippet
with 15 tokens before and 15 tokens a昀琀er the tool mention as a sequence-label pair.
          Example: [analysis of the sample tweets . Students evaluated each of the default
          analysis tools in] [Voyant][, and were given examples of how to use the visual-
          izations , statistics , and]
5
    The dataset is available from https://gitlab.gwdg.de/sshoc/data-ingestion/-/tree/master/repositories/extraction.
    Note that cor-2 is already split into sentences which is di昀昀erent to the format we used in cor-1.


                                                         413
Table 1
List of all sources which were used to compile an inital list of tools. The resulting list was used to create
a dataset with training data using the tool names as search strings.
       Name                            Type           URL
       Programming Historian           Tutorials      https://programminghistorian.org/
       forTEXT                         Tutorials      https://fortext.net/
       Teresah                         Catalogue      https://github.com/lehkost/ToolXtractor/blob/
                                                      master/src/main/resources/tools_teresah.txt
       TAPoR                           Catalogue      https://tapor.ca/home
       Digital Methods Initiative      Catalogue      https://wiki.digitalmethods.net/Dmi/ToolDatabase
       Alan Liu’s DH Toychest          Catalogue      http://dhresourcesforprojectbuilding.pbworks.com
       DMI                             Catalogue      https://wiki.digitalmethods.net/Dmi/ToolDatabaseh
       DigiHum                         Catalogue      https://digihum.de/tools/


   Such a sequence is annotated with the label tool. Then, the set of tool sequences was
supplemented by randomly chosen no_tool examples. Those were gathered from evenly dis-
tributed samples from the entire corpus and serve as negative examples of tool occurrences.
For the multi-class training dataset, we used the available categories which TAPoR provides
for the majority of the 246 tools and mapped them to their corresponding TaDiRAH [2] anno-
tations in our training set. For those tools that were still lacking a category annotation, we
provided a manual TaDiRAH annotation. Since TAPoR sometimes assigns multiple categories
for single tools, we selected the – in our opinion – best-昀椀tting category to prevent our classi昀椀er
from having to deal with multi-label situations. 6
   cor-2 is used to test the generalization capabilities of our model and to identify external
examples which can be used to enrich the training data. In a second experiment, we will
investigate how the model performance is in昀氀uenced by this alteration. Each line in cor-2
contains a sentence and an annotation that documents whether there are tool mentions in
a particular section of the sentence or not. If any tool occurrences are annotated, we take
over the sentence with the label tool. If there are no tool annotations, we label the sentence
as no_tool. This data set therefore o昀昀ers the opportunity for us to de昀椀ne an external gold
standard.


4. Methods
So昀琀ware and tool entity recognition can be seen as a text classi昀椀cation problem. Our approach
is a supervised machine learning method that consists of a binary and a multi-class classi昀椀-
cation task. The basic idea of the approach is that the mention of a tool in a paper happens
in a speci昀椀c linguistic context, which can be mapped and classi昀椀ed in the form of a sequence
embedding using a Transformer-based language model. On the one hand, we try to determine
whether a tool is in a sequence (binary task) and on the other hand, we try to assign a tool
usage category to a sequence with a tool named in it (multi-class task).

6
    All the tools and their TaDiRAH categories can be found here: https://git.informatik.uni-leipzig.de/computationa
    l-humanities/tools-in-dh/tool-classifier/-/blob/main/resources/tool_lists/tadirah_taxonomy.json


                                                         414
   We used a uniform framework for all classi昀椀cation tasks. The superior performance of
Transformer-based language models in the context of text classi昀椀cations was one of the main
decisions for our technical implementation. We experimented on the Transformer-based model
RoBERTa-Base as the main input for the classi昀椀cation component [18]. For e昀케ciency reasons,
distilRoBERTa-Base was used as a ”case-sensitive” and ”knowledge distilled” variant, as it pro-
duced comparably good results to the RoBERTa-Base variant, but with considerably less com-
putation e昀昀ort. In detail, the solution consists of a Transformer-based encoder which uses the
classi昀椀cation token ([CLS]) of the model directly, a linear layer, and an output layer with So昀琀-
max activation. For each instance, the encoder converts each token into an embedding vector
and adds a special classi昀椀cation token at the beginning of the sequence. The 昀椀nal hidden state
corresponding to this token is used as the aggregate sequence representation for classi昀椀cation
tasks [9]. This sequence embedding is then fed into the linear layer and the output layer. The
output is a probability vector with values between 0 and 1, which is used to compute a logis-
tic loss during the training phase, and the class with the highest probability is chosen during
the inference phase. Language models are o昀琀en trained on very general language and do not
represent the language well for a particular text source. Therefore, a 昀椀ne-tuning step is added
and the weights of the language model are adjusted to better support the classi昀椀cation task
at hand. In the experiments, di昀昀erent numbers of epochs were used in 昀椀ne-tuning to test the
generalization. 1.0 and 0.2 epochs were used on a batch size of 16. We also experimented with
di昀昀erent amounts of epochs but found only one setting to be superior among all other experi-
ments. Because of that, we only used the reported epoch numbers for the experiments in this
paper.
   The experiment design in our study consists of several parts that build on each other.
In the 昀椀rst step, we use only the training data from cor-1. Following this step, we assess
automatically classi昀椀ed false positive (FP) examples from cor-2 to qualitatively investigate
errors in the model. In a second experiment, we enrich our training data with the identi昀椀ed
FP-examples from cor-2 and re-evaluate the approach.

Experiment 1: In our 昀椀rst experiment, the Transformer model is 昀椀ne-tuned on the training
data. Instead of using a static train, test and validation split, the datasets for each of the
two classi昀椀cation tasks were split into 昀椀ve training and test sets according to the principle
of a 5-fold cross-validation. This was done using the K-fold component in the SciKitLearn
framework7 . Finally, the dataset consists of 18,898 sequence-label pairs divided into 5
folds. For each round of cross-validation, four folds were used for training and the 昀椀昀琀h fold
was used as a test dataset for validation. The experiment tests performance individually
for the binary and multi-class cases on cor-1. We report on the average performances
for each split of cor-1 in the evaluation section. We also evaluate the binary case of ex-
periment 1 twice. On cor-2, we apply the binary model and use the existing annotations
of this dataset to evaluate whether our classi昀椀er can correctly determine the existence of a tool.

Qualitative Model Assessment: In a second step, we use this model to identify additional
tool mentions from cor-1 and cor-2 which are not yet annotated. In detail, we look at where

7
    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html


                                                        415
Table 2
Presentation of the evaluation results of the binary model on our corpora. The model was evaluated
based on cor-1 via 5-fold cross-validation and cor-2 could be evaluated via the annotations it contains.
With +500 we denote the evaluation based on the model which is trained with additional 500 examples
of FP from cor-2. Classification results are averages of the 5-fold cross-validation.
                              Corpus        Precision     Recall    F1
                              cor-1         0.97          0.99      0.98
                              cor-2         0.57          0.88      0.69
                              cor-1 +500    0.96          0.99      0.97
                              cor-2 +500    0.67          0.90      0.77


the model makes mistakes and which text components are relevant for the classi昀椀cation. This
allows us to better understand what the model learns and what consequences our modeling
decisions have.

Experiment 2: In the second experiment, we will expand the training data set. Based on the
evaluations from experiment 1 and the qualitative tests, we repeat the evaluation again under
di昀昀erent conditions. For this purpose, we will include the false positive examples from cor-2
into the training data.


5. Evaluation
In this section we present the evaluation of the classi昀椀cation results with the values precision,
recall and F1. Furthermore, we will investigate the multi-class case with micro and macro
averaged F1 values. In the 昀椀rst experiment, we will document the binary and the multi-class
case. In the qualitative evaluation, we will show some examples and explain the decisions of
the classi昀椀er. Based on these observations we adapt the model and repeat the evaluation.
   As mentioned before, we based the evaluation in our 昀椀rst experiment on a 5-fold cross-
validation restricted to cor-1. Based on the averaged metrics, the result of the binary classi昀椀-
cation is very good on the corpus internal evaluation. The detailed results are shown in Table 2.
For the second case of multi-class classi昀椀cation, Table 3 shows the results. With an accuracy
of 0.974 the model shows excellent performance within cor-1. Although the recall is below 0.9
in two categories, the overall precision is very high.
   Based on these promising results, we wanted to test how the model behaves when trans-
ferred to another text type. Since we have annotations in cor-2 to indicate whether a sentence
contains a tool, we can evaluate the binary approach very easily outside the training data. The
results are documented in Table 2. Although we have a very high recall in the classi昀椀cation of
tools, the precision here is very low. This can only be due to false positive classi昀椀cations made
in cor-2. For this reason, we will look at the classi昀椀cation decisions in some more detail in the
next step and take a closer look at the false positive decisions found in cor-2. We looked at the
models output using the Captum-based Transformers Interpret library [16, 21]. This library cal-
culates weights on how individual words in a sequence contribute to a classi昀椀cation. Thereby,
positive values are associated with the searched class. In Table 4 we show three examples


                                                 416
Table 3
Presentation of the evaluation results of the multi-class model. The results are averages that are pro-
duced with a 5-fold cross-validation on cor-1.
                           Class               Precision Recall F1
                           Analysis            0.946        0.974     0.960
                           Capture             0.984        0.964     0.972
                           Creation            0.932        0.852     0.890
                           Dissemination       0.950        0.972     0.960
                           Enrichment          0.966        0.936     0.952
                           Storage             0.954        0.848     0.898
                           no_tool             0.992        0.984     0.990
                           Accuracy (Micro)                           0.974
                           Macro               0.956        0.938     0.950

Table 4
Illustration of three selected examples of FP classifications from cor-2. The background colors represent
the weights for the positive and negative class. Terms contributing to the positive class are green and
red is assigned to terms contributing to the negative class.
  Example A      Di昀昀 erences    in   demographic       and   other   characteristics   noted    there
                 were   examined      using   a   Chi    Square   test   ,   and   di昀昀erences   were
                 considered significant if p < 0 . 05 .

  Example B      The angle display was realized simply by two bars , which could
                 be opened or closed by le昀琀 and right mouse clicks .

  Example C      Four of the cafeteria diet food items were administered per day and
                 variety of diet was maintained by alternating food items daily .


taken from the output of Transformers Interpret. The separation of the tokens was determined
directly from the Transformer, which uses subword tokens. This results in fragmented single
words in some representations (see Di昀昀 – erences in example A).
   The FP predicted in cor-2 are almost all descriptions of a methodological approach. In these
examples, one can relatively well recognize a certain pattern of syntactic components. In all
examples, and also in many FP classi昀椀cations from cor-2, there are constructions of nouns and
proper nouns followed by verbs that indicate a methodological activity. In our examples, we
昀椀nd were examined using, were considered signi昀椀cant if, was realized simply by, be opened or
closed by, were administered per and was maintained by. This surface form is very similar to
the constructions used in the naming of tools. Actually, we could just insert a tool at the end
of many of the phrases and the sentence would make perfect syntactical sense. We believe that
all of these phrases are standardized language constructs to document a procedural approach.
Since the naming of tools typically belongs to the methodical part of the text, their descriptions
and what was done with them strongly resemble these passages.


                                                  417
   The lesson to be learned here, is that our classi昀椀er learns more about syntactical patterns or
sentence constructions than about the semantic representations of the entity references they
contain. We conclude that cor-2 by containing a broader scienti昀椀c domain, there are more
such constructs in the data that would not meet the de昀椀nition of a tool when annotated. While
examples A-C demonstrate how syntactic constructions in昀氀uence the classi昀椀cation model in a
way it is likely to produce false positives, there are also many examples for words that have
higher weights and that actually imply some realistic tool context, for instance measure, (arith-
metic) mean, assess, score or analyses.
   A昀琀er showing that the false positive classi昀椀cations come from a somewhat more restrictive
de昀椀nition of what is and is not a tool, we can still evaluate whether the results can be improved
by expanding the training data set with the false positive classi昀椀cations from cor-2. We re-
trained the model with 500 additional data points and evaluated it again on the cor-2. The
results show a boost in quality, which also shows that our model is easy to adapt and can be
adapted to other de昀椀nitions of tool mentions.
   The experiments with extended training samples show that we can easily extend our training
data with another dataset and not lose any quality on the cor-1. On the contrary, we gain
almost 0.08 points on the cor-2 and the precision has increased signi昀椀cantly. Although both
datasets were created and annotated with a slightly di昀昀erent idea of what a tool is, we were
able to create a reliable classi昀椀er. We assume that the loss of quality with cor-2 is also due to
the fact that we do not have sequences, but sentences, which can sometimes be very short.


6. Model Applications – Examples
In this section we demonstrate how our classi昀椀er can be used to detect and classify tools in
cor-1 – a corpus with three popular DH journals. With the detection of tools in scienti昀椀c pa-
pers we can learn more about methodological standards and so昀琀ware packages that are being
used within a certain discipline, such as the DH. Furthermore, the categorization of tool oc-
currences is useful to understand typical scholarly activities of a discipline and to investigate
which activities have been digitally enhanced so far.

6.1. Application of the binary model
As was argued in the beginning, the TAPoR 3.0 tool repository includes a great number of tools
already. However, using our binary classi昀椀er on cor-1, we found numerous examples of tools
that are currently not included in TAPoR 3.0. Our approach produces sequences of length 31
tokens and labels those as tool or no_tool. To be able to extract the actual tool entities from
these sequences, we used the Stanford Stanza POS tagger [22] and generated a list of all the
proper nouns contained in the sequences. Besides tool entities, these also include place and
person names, organizations and institutions, etc. We had a look at any proper noun with a
document frequency > 3 and removed any non-tool entities from the list. This way we were
able to detect around 100 tools that are not yet reported in TAPoR 3.0.8

8
    Note that we also identi昀椀ed around 50 tool candidates that need some more close reading to verify if they are
    actual tools or some other named entity.


                                                        418
  In the following we present the main types of tools (manually assigned by us) that were
identi昀椀ed by our classi昀椀er.9

       • markup technologies (TEI, XML, SGML, HTML, XSLT, XPointer, etc.)
       • metadata standards (Dublin Core, CIDOC, MARC, etc.)
       • programming languages (Java, BASIC, Pascal, COBUILD, PASCAL, PHP, etc.)
       • authoring tools (HyperCard, Photoshop, Storyspace, PageMaker, etc.)
       • operating systems (DOS, UNIX, Linux, Windows, etc.)
       • web browsers (Netscape, Lynx, Internet Explorer, etc.)
       • web services (Google, Google Books, Google Scholar, Google Earth, Google Maps, Face-
         book, YouTube, Wikipedia, Yahoo, Flickr, Gopher, Internet Archive, etc.)
       • NLP (Word Net, TreeTagger, etc.)
       • crowdsourcing (Mechanical Turk, Transcribe Bentham, etc.)
       • corpora and databases (Perseus, Project Gutenberg, EEBO, Europeana, HathiTrust, Pro-
         Quest, JSTOR, etc.)
       • statistics (SPSS, Zeta, etc.)
       • word processors (Microso昀琀 Word, WordPerfect, WordStar, WordNet, Wordstar, MS Word,
         etc.)
       • infrastructures (CLARIN, DARIAH, etc.)
       • hardware (Kinect, iPad, etc.)

   One reason for these new tool discoveries could well be our very broad de昀椀nition of digital
tools, which might be somewhat narrower in TAPoR 3.0. For future analyses of tools in DH, and
also for a further 昀椀ne-tuning of the classi昀椀er, we will need to discuss what should be counted as
a tool and what not. While we believe markup technologies and programming languages could
be classi昀椀ed as ”tools for making” (as opposed to tools for exploring and thinking) [3], web
services and corpora might be harder to justify as actual tools.

6.2. Application of the multi-class model
Our multi-class model was able to classify the tool sequences into six of the seven main TaDI-
RaH categories for tool types. ”Interpreting” was not included in the model, as we did not have
enough examples in our training data. Interestingly, tools for ”Analysis” are by far the most
frequent tools, followed by tools for ”Dissemination” and ”Enrichment” (see Table 5).
   To see how these categories have evolved through time we have plotted the frequencies
diachronically (see Figure 1). While we can observe a general diachronic upwards trend for
most of the categories, it is noticeable that tools for analysis are not only the most frequent
tool category in total, but that they are also more or less consistently mentioned in DH publi-
cations throughout the whole time span – with a general upwards trend, similar to the other
categories. Tools for analysis include examples from many di昀昀erent areas, including network

9
    For a list of all tool entities see https://docs.google.com/spreadsheets/d/1ZMV3kFAF2mooclIlvW_0gJYSh68FlE_j
    gGhyeT-VVqc/edit?usp=sharing


                                                       419
Table 5
Overview of TaDiRAH categories and how o昀琀en they were assigned to tool sequences in our corpus.
                                  Category          Frequency
                                  Analysis               8,121
                                  Dissemination          1,548
                                  Enrichment             1,289
                                  Capture                  801
                                  Storage                  636
                                  Creation                 482


Figure 1: Diachronic overview of the identified TaDiRAH categories. The Count axis represents the
normalized document counts of a categories occurrence in a year. Normalization was done using the
total document count within a year.


analysis (Gephi, Cytoscape, NodeGoat, networkX), natural language processing (NLTK, Stand-
ford POS Tagger, Gensim), text analysis (Voyant, AntConc), geo information systems (ArcGIS,
Lea昀氀et), music analysis (music21) and many more. This observed dominance of tools for anal-
ysis is also interesting for future work, as we believe this is the category of tools that has the
most potential for epistemological shaping of scholarly research processes, which means we
will have a closer look at this category in follow-up studies.


7. Conclusion & Future Work
In this paper we have presented a Transformer-based classi昀椀er for the detection and classi昀椀-
cation of digital tools. The task of tool detection worked very well (F1 = 0.978) for a large
corpus of DH journals, which also was the basis of the training data. Tool detection results
for another corpus that includes ADHO abstracts and PLOS ONE articles from Sociology and


                                              420
Linguistics also worked well, with an F1 score of 0.69 and an increase to 0.77 a昀琀er 昀椀ne-tuning
the model with 500 false positive examples from the novel corpus. It can be concluded that the
model yields good results in the domain of DH journal publications already, and can also be
successfully adopted to other domains.
   We plan to use this model for follow-up studies in which we want to investigate the actual
tool results and their epistemological implications for DH scholars in a more detailed way.
First, we plan to enhance our current DH corpus by adding some more recent journals such
as ”Journal of Cultural Analytics”10 or ”International Journal of Digital Humanities”11 . We
also plan to add abstracts from past DH conferences, which are available from the ”Index of
Digital Humanities Conferences”12 . We will evaluate the performance of our model on such an
enlarged corpus and 昀椀ne-tune it with further training data if necessary. As was mentioned in
the previous section, we will also have to agree upon a working de昀椀nition of actual tools [see
3] as opposed to digital resources, services, infrastructures, etc. We certainly plan to correlate
the tools and tool categories with further metadata such as geolocation of authors, gender of
authors, disciplinary background of authors, keywords and LDA topics. By means of such
large-scale analyses we hope to be able to contribute to the historiography of DH through the
lens of what has been called ”tool science” [24].


References
 [1] L. Barbot, F. Fischer, Y. Moranville, and I. Pozdniakov. Which DH Tools Are Actually Used
     in Research? 2019. url: https://weltliteratur.net/dh-tools-used-in-research/.
 [2] L. Borek, Q. Dombrowski, J. Perkins, and C. Schöch. “TaDiRAH: a Case Study in Prag-
     matic Classi昀椀cation”. In: Digital Humanities Quarterly 10.1 (2016).
 [3] J. Bradley. “Digital tools in the humanities: Some fundamental provocations?” In: Digital
     Scholarship in the Humanities 34.1 (2019), pp. 13–20.
 [4] M. Burghardt and J. Luhmann. “Same same, but di昀昀erent? On the Relation of Information
     Science and the Digital Humanities A Scientometric Comparison of Academic Journals
     Using LDA and Hierarchical Clustering”. In: (2021).
 [5] M. Burghardt, J. Luhmann, and A. Niekler. “Tools as Epistemologies in DH? A Corpus-
     Based Exploration”. In: Book of Abstracts of the ADHO Digital Humanities Conference.
     Tokyo, 2022.
 [6] V. Bush. “As We May Think”. In: The Atlantic (1945).
 [7] M. Dalbello. “A genealogy of digital humanities”. In: Journal of Documentation 67.3 (2011),
     pp. 480–506.
 [8] David Schindler, Felix Bensmann, Stefan Dietze, and Frank Krüger. “The role of so昀琀ware
     in science: a knowledge graph-based analysis of so昀琀ware mentions in PubMed Central”.
     In: PeerJ Computer Science 14.8 (2022).
10
   https://culturalanalytics.org/
11
   https://www.springer.com/journal/42803
12
   https://dh-abstracts.library.virginia.edu/


                                                421
 [9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirec-
     tional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference
     of the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies. Minneapolis, Minnesota: Association for Computational Linguis-
     tics, 2019, pp. 4171–4186.
[10]   Q. Dombrowski. “What Ever Happened to Project Bamboo?” In: Literary and Linguistic
       Computing 29.3 (2014), pp. 326–339.
[11]   F. Fischer and Y. Moranville. DH Tools Mentioned in ”The Programming Historian”. 2020.
       url: https://weltliteratur.net/dh-tools-programming-historian/.
[12]   F. Fischer and Y. Moranville. Tools mentioned in DH2020 abstracts. 2020. url: https://wel
       tliteratur.net/tools-mentioned-in-dh2020-abstracts/.
[13]   K. Grant, Q. Dombrowski, K. Ranaweera, O. Rodriguez-Arenas, S. Sinclair, and G. Rock-
       well. “Absorbing DiRT: Tool Directories in the Digital Age”. In: Digital Studies/le Champ
       Numérique 10.1 (2020).
[14]   U. Henny-Krahmer and D. Jettka. “So昀琀warezitation als Technik der Wissenscha昀琀skultur:
       Vom Umgang mit Forschungsso昀琀ware in den Digital Humanities”. In: DHd2022: Kulturen
       des digitalen Gedächtnisses. Konferenzabstracts. Potsdam, 2022, pp. 203–206.
[15]   Kaden, Ben. Zur Epistemologie digitaler Methoden in den Geisteswissenscha昀琀en. 2016. url:
       https://zenodo.org/record/50623.
[16]   N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov,
       N. Kliushkina, C. Araya, S. Yan, O. Reblitz-Richardson, and F. Ai. “Captum: A uni昀椀ed and
       generic model interpretability library for PyTorch”. In: (2020), p. 11. url: https://arxiv.o
       rg/abs/2009.07896.
[17]   J. C. R. Licklider. “Man-Computer Symbiosis”. In: IRE Transactions on Human Factors in
       Electronics Hfe-1.1 (1960), pp. 4–11.
[18]   Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
       V. Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv:
       1907.11692[cs]. url: http://arxiv.org/abs/1907.11692.
[19]   P. Lopez and Laurent Romary. “GROBID - Information Extraction from Scienti昀椀c Publi-
       cations”. In: ERCIM News 100 100 (2015).
[20]   J. Luhmann and M. Burghardt. “Digital humanities–A discipline in its own right? An
       analysis of the role and position of digital humanities in the academic landscape”. In:
       Journal of the Association for Information Science and Technology 73.2 (2022), pp. 148–
       171.
[21]   C. Pierse. Transformers Interpret. Version 0.5.2 date-released: 2. Feb. 14, 2021. url: https:
       //github.com/cdpierse/transformers-interpret.
[22]   P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning. “Stanza: A Python Natural
       Language Processing Toolkit for Many Human Languages”. In: Proceedings of the 58th
       Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
       Online: Association for Computational Linguistics, 2020, pp. 101–108.


                                               422
[23]   S. Ramsay and G. Rockwell. “Developing Things: Notes toward an Epistemology of Build-
       ing in the Digital Humanities”. In: Debates in the Digital Humanities. Minneapolis; Lon-
       don: University of Minnesota Press, 2012.
[24]   C. Wol昀昀. “The case for teaching ”tool science”: Taking so昀琀ware engineering and so昀琀ware
       engineering education beyond the con昀椀nements of traditional so昀琀ware development
       contexts”. In: 2015 IEEE Global Engineering Education Conference (EDUCON). Tallinn, Es-
       tonia: Ieee, 2015, pp. 932–938.
[25]   A. Zarei, Y. Seung-Bin, F. Fischer, M. Ďurčo, and P. Wieder. “Measuring the Use of Tools
       and So昀琀ware in the Digital Humanities: A Machine-Learning Approach for Extracting
       So昀琀ware Mentions from Scholarly Articles”. In: Book of Abstracts, ADHO DH Conference.
       Tokyo, 2022.
[26]   Zarei, Alireza, Seung-Bin, Yim, Ďurčo, Matej, Illmayer, Klaus, Barbot, Laure, Fischer,
       Frank, and Gray, Edward. “Der SSH Open Marketplace: Kontextualisiertes Praxiswissen
       für die Digital Humanities”. In: DHd2022: Kulturen des digitalen Gedächtnisses. Konferen-
       zabstracts. Potsdam, 2022.


                                              423