A Workflow for Integrating Close Reading and
                 Automated Text Annotation

    Maciej Janicki1 , Eetu Mäkelä1 , Anu Koivunen2 , Antti Kanner1 , Auli Harju2 , Julius
                               Hokkanen2 , and Olli Seuri3
                   1
                      Department of Digital Humanities, University of Helsinki
                        2
                          Faculty of Social Sciences, Tampere University
      3
        Faculty of Information Technology and Communication Studies, Tampere University


         Abstract. We present a workflow for Digital Humanities projects allowing to
         combine close and distant reading, as well as automated text annotation, in an
         iterative process. We rely on mature tools and technologies, like R, WebAnno or
         Prolog, that are combined in a highly automated pipeline. Such architecture can
         deal well with underspecified and frequently changing requirements and allow
         a continuous exchange of information between the computational and domain
         experts in all stages of the project. The workflow description is illustrated on a
         concrete example concerning news media analysis.

         Keywords: Workflow, Close Reading, Distant Reading, Interdisciplinary Coop-
         eration, CSV.


1     Motivation
Digital Humanities projects often involve application of language technology or ma-
chine learning methods in order to identify phenomena of interest in large collections
of text. However, in order to maintain credibility for humanities and social sciences, the
results gained this way need to be interpretable and investigable and cannot be detached
from the more traditional methodologies which rely on close reading and real text com-
prehension by domain experts. The bridging of those two approaches with suitable tools
and data formats, in a way that allows a flow of information in both directions, often
presents a practical challenge.
     In our research, we have developed an approach to digital humanities research that
allows combining computational analysis with the knowledge of domain experts in all
steps of the process, from the development of computational indicators to final analy-
sis [4]. Put succintly, our approach hinges on, as early as possible, creating an environ-
ment where both the research data as well any computational enrichments and analyses
done on it can be shown, pointed to and discussed, both from the perspective of the
domain experts as well as from the perspective of the computational experts. Further,
because at the start of the project neither the computational indicators nor axes of anal-
ysis are yet finalized, the environment must support easy iterative updating.
     In this poster, we describe a particular implementation of this approach, as it appears
in the project: Flows of Power: media as site and agent of politics. This project is a col-
laboration between journalism scholars, linguists and computer scientists aimed at the


              Copyright © 2021 for this paper by its authors. Use permitted under
             Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                              231


analysis of the political reporting in Finnish news media over the last two decades. We
study both the linguistic means that media use to achieve certain goals (like appearing
objective and credible, or appealing to the reader’s emotions), as well as the structure
of the public debate reflected there (what actors get a chance to speak and how they
are presented). What we will here be particularly focusing on are the technical aspects,
as they relate to 1) enabling interaction between different elements of our development
and analysis environment and 2) enabling iterative development.


2      Software and Data Formats
As many research questions in our project concern linguistic phenomena, a Natural
Language Processing pipeline is highly useful. We employ the Turku neural parser
pipeline [2], which provides dependency parsing, along with lower levels of annota-
tion (tokenization, sentence splitting, lemmatization and tagging). Further, we apply
the rule-based FINER tool [5] for named entity recognition.

R and Shiny. Our primary toolbox for statistical analysis is R. This motivates using the
‘tidy data’ CSV format [6] as our main data format. In order to keep the number and
order of columns constant and predictable, only the results of the dependency parsing
pipeline are stored together with the text, in a one-token-per-line format very similar
to CONLL-U.4 All additional annotation layers, beginning with named entity recog-
nition, are relegated to separate CSV files, where tuples like (documentId, sentenceId,
spanStartId, spanEndId, value) are stored. Such tabular data are easy to manipulate
within R.
    In terms of applications and interfaces, we favour web applications over locally
installed ones. First, these have a lower barrier of entry, being available for use from
anywhere a web browser is installed. Second, they allow easier sharing of views with
other project participants through copying and pasting of stable URLs. This is impor-
tant, as in our approach to the research process, the focused sharing of examples of both
in-domain objects of interest, as well as the results of automated processing plays a cru-
cial part in building a common understanding, and guiding development and analysis.
Finally, because our work requires iterative development, it is much easier to update
both data as well as functionality centrally once, as opposed to everyone needing to
download newer versions of data and programs all the time.
    Based on the above considerations, for distant reading and discovery of statistical
patterns, we rely on a Shiny5 Web application that we developed ourselves (Fig. 1). It
allows easy access to aggregate views of the dataset based on variables like the pro-
portion of quotes, affective expressions, or other automatically generated annotations.
The scatterplot view (Fig. 1, right) is useful for drilling down into examples of various
parts of the distributions, and particularly for detecting and exploring outliers. In this
view, each point represents an article, and clicking on it shows detailed information,
including the headline, source and a link to our close reading interface (see below).
This functionality is illustrated in the figure by the red frames: the upper one shows the
 4
     https://universaldependencies.org/format.html
 5
     Shiny is a framework for building Web-based user interfaces in R.
                                              232


point that has been clicked, and the lower one the details of that article. Through this,
researchers are both able to interpret what particular portions of the distribution mean
in practice in the texts they represent, as well as examine whether outliers are caused
by errors in our processing pipeline, or real in-domain phenomena of interest.


             Fig. 1. The Shiny app for explorative analysis of statistical patterns.


WebAnno. For the visualization of automatic annotations, close reading and manual an-
notation, we decided to employ WebAnno [1].6 While this tool was originally intended
for the creation of datasets for language technology tasks, its functionality is designed
to be very general, which enabled its use in a wide variety of projects involving text an-
notation.7 In addition to the usual linguistic layers of annotation, like lemma or head, it
allows the creation of custom layers and feature sets. WebAnno has a simple but power-
ful visualization facility: annotations are shown as highlighted text spans, feature values
as colorful bubbles over the text, and the various annotation layers can be shown or hid-
den at user’s demand (Fig. 2). This kind of visualization does not disturb close reading.
It allows to concentrate on the features that are currently of interest, while retaining the
possibility to look into the whole range of available annotations.
     WebAnno supports several data formats for import and export. All of them assume
one document per file. Among others, different variants of the CONLL format are
supported. WebAnno-TSV is an own tab-separated text format, which, as opposed to
CONLL, includes support for custom annotation layers. Because it is a text format and
is well documented, we were able to implement a fully automatic bidirectional conver-
sion between our corpus-wide, per-annotation CSV files and per-document WebAnno-
TSV files. Thus, using WebAnno as an interface to interact with the domain experts
who perform close reading and manual annotation, we are able to exchange our results
quickly and with a high degree of automatization.
     One problem we initially had with integrating WebAnno as part of our ecosystem
was that while WebAnno did allow linking to each document by URLs, these were
 6
     https://webanno.github.io/webanno/
 7
     see: https://webanno.github.io/webanno/use-case-gallery/
                                                           233


    Fig. 2. WebAnno displaying the automatically obtained annotation layers ‘named entity’ (grayish
    blue), ‘hedging/threat’ (orange) and ‘indirect quote’ (purple).


    based on internal IDs that changed each time we reloaded the data (which, in our itera-
    tive development process, happened frequently). In order to increase the interoperabil-
    ity of WebAnno with our other tools, we contributed patches that allow the projects and
    documents to be referenced by name instead of these internal IDs. Thus, a document
    doc within the project project can be accessed via the URL:
        http://webanno-instance-url/annotation.html?#!pn=project&n=doc.
    This way, the WebAnno view of an annotated document can be easily linked to from
    any other tool just by knowing the document name.

    Prolog. Finally, some automatic annotations are produced by rule-based approaches
    implemented in Prolog. Thus, another document representation that we utilize is a set
    of Prolog predicates encoding the sentence structure and the linguistic annotation. A
    schema illustrating the complete back-end with all employed data formats is shown in
    Fig. 3.


                                                          NLP pipeline
                                                        (Turku Neural Parser)
                                              corpus               corpus
                                                                                annotations                 R
                      FINER wrapper                            CSV
                                                                                              (statistical analysis, plotting)
                                        NER annotations            corpus

                                                        format conversion
   ions


                                                                   corpus
   otat


                                                                                  corpus              WebAnno
                                                         WebAnno-TSV
ann


                                                                                              (visualization, close reading)
                                                                   document

                                                        format conversion
                                                                   document
            rule-based annotation            document
                                                              Prolog
           (e.g. indirect quote detection)


                                  Fig. 3. A dataflow diagram of the back-end.
                                            234


3   Case Study: Affective and Metaphorical Expressions in Political
    News
We applied the methodology outlined above in a recently conducted case study. The
subject of the study was the use of affective and metaphorical language in a media
debate about a controversial labour market policy reform, called ‘competitiveness pact’
which was debated in Finland in 2015-16.
    The linguistic phenomenon in question is complex and not readily defined. It is also
context-dependent: ‘the ball is in play’ is metaphoric when referred to politics, but not
when referred to sports. There is no straightforward method or tool for automatic recog-
nition of such phrases. Therefore, we started the study with a close reading phase, in
which the media scholars identified and marked the phrases they recognized as affective
or metaphorical in the material covering the competitiveness pact. The marked passages
were subsequently manually post-processed to extract single words with ‘metaphoric’
or ‘affective’ charge. The list of words obtained this way was further expanded with
their synonyms, obtained via word embeddings. Using this list, we were able to mark
the potential metaphoric expressions in the unread text as well. The results of this anal-
ysis were published in [3].


4   Conclusion
We have presented a particular realization of our general approach for combining qual-
itative and quantitative analysis in a cooperation between computational and social sci-
ences. The main characteristic of this approach is utilizing mature existing tools, like R,
WebAnno, Prolog and Turku neural parser for specialized subtasks, while focusing our
own contributions on a pipeline that combines these tools into an interlinked ecosystem
with support for iterative development and focused discussion between the computer
scientists and the domain experts. We find such an architecture to be more appropriate
than custom-built monolithic environments, as the requirements on the computational
toolkit are not known in advance and may frequently change in result of the interaction
with the data. The high degree of automatization allows us to rerun the data conversion
steps frequently, thus allowing the insights gained via close reading to feed back to auto-
matic annotation and statistical analysis. The usability of this workflow was confirmed
in an independently published case study.


References
1. Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid Muhie Yimam, Silvana Hartmann,
   Iryna Gurevych, Anette Frank, and Chris Biemann. A Web-based Tool for the Integrated An-
   notation of Semantic and Syntactic Structures. In Proceedings of the Workshop on Language
   Technology Resources and Tools for Digital Humanities (LT4DH), pages 76–84, Osaka, Japan,
   2016.
2. Jenna Kanerva, Filip Ginter, Niko Miekka, Akseli Leino, and Tapio Salakoski. Turku neural
   parser pipeline: An end-to-end system for the CoNLL 2018 shared task. In Proceedings of the
   CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies,
   pages 133–142, Brussels, Belgium, October 2018. Association for Computational Linguistics.
                                              235


3. Anu Koivunen, Antti Kanner, Maciej Janicki, Auli Harju, Julius Hokkanen, and Eetu Mäkelä.
   Emotive, evaluative, epistemic: a linguistic analysis of affectivity in news journalism. Jour-
   nalism, February 2021.
4. Eetu Mäkelä, Anu Koivunen, Antti Kanner, Maciej Janicki, Auli Harju, Julius Hokkanen,
   and Olli Seuri. An approach for agile interdisciplinary digital humanities research — a case
   study in journalism. In Twin Talks: Understanding and Facilitating Collaboration in Digital
   Humanities 2020. CEUR Workshop Proceedings, October 2020.
5. Teemu Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, and Krister Lindén. A Finnish
   news corpus for named entity recognition. Lang Resources & Evaluation, August 2019.
6. Hadley Wickham. Tidy data. Journal of Statistical Software, 59(10), August 2014.