Pilot Experiments of Hypothesis Validation
                              Through Evidence Detection for Historians
                                              Chris Stahlhut†‡, Christian Stab†, Iryna Gurevych†
                                                               †Ubiquitous Knowledge Processing Lab
                                                                 ‡ Research Training Group KRITIS
                                                                        Darmstadt, Germany
                                                                     www.ukp.tu-darmstadt.de

ABSTRACT                                                                                    brings a profit of 1 million Euros." Afterwards, the historian then
Historians spend a large amount of time in archives reading doc-                            continues to look for more evidence supporting or attacking the hy-
uments to pick out the small text quote they can use as evidence.                           pothesis in the vast number of documents of the political discourse
This is a time consuming task that automated evidence detection                             and, if necessary, revises the hypothesis.
promises to speed up significantly. However, no evidence detec-                                 In this example, the historian can benefit greatly from document
tion method has been tested on a dataset that contains hypotheses                           retrieval as it would dramatically reduce the number of irrelevant
and evidence created by humanities researchers. Furthermore, no                             documents to read. However, this will only help to create the bibli-
research has yet been conducted to understand how historians ap-                            ography of the source material which can still be very long. Picking
proach this task of developing hypotheses and finding evidence.                             out the few pieces of evidence contained even within this reduced
In this paper, we analyse the behaviour of 16 students of the hu-                           number of documents still takes a lot of time. This task of finding
manities in developing and validating hypotheses and show that                              textual sources, or evidence, relevant to a given hypothesis or claim
there is no canonical user; even when given the same exercise, they                         is researched under the name of Evidence Detection (ED).
develop different hypotheses and annotate different text snippets                               While ED is extensively studied in the research field of Argument
as evidence; and current state-of-the-art argument mining methods                           Mining (AM) [6, 10], all existing methods are trained once on a
are not suitable for historical validation of hypotheses. We there-                         fixed set of training examples; and rarely an approach focusses on
fore conclude that an evidence detection method must be trained                             researchers in the humanities as users, let alone how they develop
interactively to adapt to the user’s needs.                                                 and validate their hypotheses. Moreover, hypotheses might change
                                                                                            over time, providing an additional challenge for static models.
KEYWORDS                                                                                        In this paper, we present for the first time (1) an analysis on how
                                                                                            scholars in the humanities develop and validate their hypotheses,
argument mining, evidence detection, hypothesis validation, infor-
                                                                                            (2) an analysis on the agreement of the evidence annotated by the
mation retrieval
                                                                                            scholars, and (3) the results of applying a state-of-the-art argument
1     INTRODUCTION                                                                          mining model for ED in the context of humanities research.
Research in humanities involves searching relevant information
                                                                                            2   RELATED WORK
in huge text collections. Say, a historian analyses the political dis-
course after the Chernobyl and Fukushima catastrophes because                               Existing approaches in ED focus on finding pieces of evidence to
he or she is working on a project about the economic development                            support a claim and classify their type, e.g. as statistics, expert
of the energy infrastructure in the second half of the 20th century.                        opinions, or anecdotal evidence. This can be done to find evidence
He or she will spend countless hours carefully studying protocols                           that supports a claim [6, 10] or to analyse the evidence used in
of political speeches and other documents, most of which do not                             online debates [1].
contain any relevant information. While reading the transcript of                               AM can be separated into two different approaches, namely dis-
a particular speech, the historian formulates the hypothesis "Ex-                           course level AM and information seeking AM. The former detects
tending the runtime of nuclear reactors is a monetary source of                             arguments inside the document structure, e.g. persuasive essays
income1 ". This figurative historian then goes back to the text he                          [13]. The latter detects arguments depending on the predefined con-
or she read previously to pick out the text snippets, or evidence,                          text [8], e.g. in the case of ED which hypothesis a piece of evidence
that lead him or her to formulate the hypothesis. For instance, the                         is related to.
statement "A depreciated nuclear reactor that runs one day longer,                              Fact checking [15] is a related field of growing interest in re-
                                                                                            search. Its goal is to find factual evidence for or against testable
1 All examples were formulated by participants of a user study in German and translated
to English by us.
                                                                                            statements, for instance on historical events in high school student
                                                                                            tests [7]. Neither of these approaches allow for personalisation or
Permission to make digital or hard copies of part or all of this work for personal or       focus on researchers in the humanities as users.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation       One area of focus in information retrieval is on supporting aca-
on the first page. Copyrights for third-party components of this work must be honored.      demic work, e.g. by finding related academic literature [5], discov-
For all other uses, contact the owner/author(s).                                            ering new [12], and recommending literature [4]. While supporting
DESIRES 2018, August 2018, Bertinoro, Italy
© 2018 Copyright held by the owner/author(s).
                                                                                            academics in finding documents, neither of these approaches con-
                                                                                            siders the evidence contained with the relevant documents.
DESIRES 2018, August 2018, Bertinoro, Italy


                              4
                              ⃝
            3
            ⃝


                                                                  1
                                                                  ⃝


                                                                               2
                                                                               ⃝


Figure 1: Screenshot of EDoHa. The Hypotheses/Evidences view allows a user to define hypotheses and link previously anno-
tated evidence to it.


   Existing approaches that work on a sub-document resolution are            user selects another document, the evidence annotations are
limited to supporting corpus exploration, for instance, by showing           replaced with the ones from the newly selected document.
how the relevance of topics changes over time [11].                        3 A list of available documents in which users can annotate the
                                                                           ⃝
                                                                             evidence. The currently selected document will be shown
                                                                             in green colour to signify its selection. If the user wishes to
3    EDOHA                                                                   see the evidence annotations from all documents, the button
                                                                             "Clear Selection" at the top right corner of the document list
We developed EDoHa (Evidence Detection fOr Hypothesis vAlida-                unselects the current document so that the list of evidence
tion) with the goal of enabling a user to validate their hypotheses          annotations is no longer limited to a single document. The
with evidence they annotated in a collection of documents. Figure 1          visible hypotheses and their linked evidence are unaffected
shows a screenshot of EDoHa in which a user already defined sev-             by this change.
                                                                           4 A Document view in which a user can select the evidence in
                                                                           ⃝
eral hypotheses and created multiple links between hypotheses and
evidence. We based EDoHa on the annotation tool WebAnno [2]                  the source documents (not visible in the screenshot).
but developed a user interface which focusses more on casual than
expert users. It consists of the following components:                    Interviews with historians during development showed that they
                                                                       require to see from which document a particular piece of evidence
    1 The Hypothesis/Evidence view allows the user to define and
    ⃝                                                                  originates. We therefore added a highlighting mechanism to the list
      revise hypotheses. In this screenshot, it shows three hypothe-   of available documents and Hypotheses/Evidence view. If a user
      ses next to each other and the evidence annotations linked       hovers the cursor over an evidence, as illustrated at the bottom of
      to them. The hypothesis is the header and each evidence an-      the screenshot, the document source of the evidence and all hy-
      notation linked to it is one multi-row cell beneath. Clicking    potheses this piece of evidence is linked to will be highlighted with a
      the ⊗ next an evidence annotation deletes the link between       dashed frame. The currently selected document will be highlighted
      the evidence annotation and the hypothesis; clicking the ⊗       with a green frame.
      next to the hypothesis deletes the hypothesis.
    2 A list with all evidence annotations, each of which the user
    ⃝
      can link to one or more hypotheses via Drag & Drop. To avoid     4    USER STUDY
      showing too much evidence so that the user needs to search       To understand how researchers in the humanities develop and
      for them, the list of evidence annotations limits its elements   validate their hypotheses and how well they agree on the evidence,
      to the ones from the currently selected document. When the       we conducted a user study with students of the humanities.
Pilot Experiments of Hypothesis Validation
Through Evidence Detection for Historians                                                                        DESIRES 2018, August 2018, Bertinoro, Italy

4.1      Setup                                                                                User2                              User3
We conducted the user study in the context of a historical semi-
nar on environmental catastrophes in the second half of the 20th
century. The participants of this seminar were students of history,
political science, or sociology in the second or third year of their
bachelor studies. The seminar covered different historic events,                              User14                             User7
such as the Chernobyl meltdown and topics of modern history,
such as Waldsterben. The study took place one week after a stu-
dent’s presentation on the Chernobyl meltdown.
    The students were asked to compare the argumentation on nu-
clear energy after the Chernobyl meltdown with the argumentation                              User20                             User0
after the Fukushima catastrophe. We prepared 9 political speeches
from the German parliament with an overall length of 479 sen-
tences, 4 after the Chernobyl meltdown and 5 after the Fukushima
catastrophe, for the students to analyse, formulate hypotheses, and
validate them. The students were able to read all speeches one week                           User15                             User12
beforehand to familiarise themselves with the texts. However, we
did not disclose the task of the exercise to them.
    Before letting the students work on the task, we gave a short
introduction into the usage of EDoHa. Afterwards, we handed out
the exercise and answered all questions the students had regarding                            User13                             User1
it.2 The students had one hour for the exercise followed by filling
out a questionnaire about their approach to evidence detection
and hypothesis validation, whether or not they would like to use
EDoHa in their studies, and how to improve. The session ended
with a discussion of the student’s findings.                                                  User5                              User18
    During the experiment, we logged multiple interactions of the
users with the system to understand how they develop and validate
hypotheses. These interactions are: clicking on a document in the list
of available documents, creating and deleting evidence annotations in
documents, creating and deleting evidence/hypothesis links, creating                          User16                             User4
and updating hypotheses (reformulating or deleting hypotheses),
and changing the view in the interface.

4.2      User behaviour
                                                                                              100                                100
We used the previously described logs to understand their general                             User19                             User17
approach to developing and validating hypotheses. Figure 2 shows
the variability of how users annotate evidence and link them to
hypotheses.
   The upper six plots show a strong separation into distinct phases
of evidence collection and hypotheses validation, or a phased approach.                       0                          55min   0                         55min
The first two users never reach the hypothesis validation phase, but
the following four always start by collecting multiple evidence                            Figure 2: The number of evidence annotations (+), evi-
annotations and then linking them to one or more hypotheses.                               dence/hypothesis links (x) for validation, and hypotheses (•)
Afterwards, they continue to collect more evidence.                                        over time. The users are ordered by the number of times they
   At the bottom we see that user19 and user17 showed no such                              changed between the Document and Hypotheses/Evidence
distinction, i.e. these users used a phase-free approach. They create                      view.
one evidence annotation and link it immediately to a hypothesis.
Afterwards they create the next evidence annotation and link this
one. In the middle, we see a transition from users with a phased                              The user’s approach towards which hypotheses they validated
approach towards a phase-free approach.                                                    also fell into two categories. Figure 3 shows each hypothesis as a
2 The exercise sheet also contained the login credentials of previously created accounts   layer where its thickness represents how much evidence is linked
(user0 – user20) and cannot be traced back to individual students. It is our understand-   to it. About half of the users validated multiple hypotheses at the
ing of the regulations at our institution that an ethics approval is only required when    same time, or concurrently (figure 3 left), whereas the other half
processing personally identifiable information. Being aware of the delicate nature
of such data, we decided to not collect any personally identifiable information and
                                                                                           validated the hypotheses sequentially, creating links to evidence for
designed the study to be anonymised as described above.                                    one hypothesis at a time, never returning to it (figure 3 right). We
DESIRES 2018, August 2018, Bertinoro, Italy

      Table 1: Agreement on evidence of similar hypotheses (top) and all hypotheses with a substantial agreement (bottom).

                                                        Hypotheses pair                                                                             Cohen’s κ
 International security arrangements in the nuclear sector         Nuclear power and security:                                                             0.116
 are necessary                                                     further expansion of domestic and foreign policy
 Nuclear-phaseout is not possible due to the profit motive         Profit maximisation of the economy                                                      0.067
 of corporations
 Chernobyl as a reminder for the nuclear-phaseout                  Chernobyl and Fukushima repeatedly related                                              0.057
 Nuclear phase-out should not be slowed down                       Is money and the economy put on the safety of each one?                                -0.007
 by individual companies

 Does the nuclear industry have too much power?                    Criticism of Fukushima                                                                  1.000
 Security of nuclear reactors must be guaranteed                   The following security measurements                                                     0.748
 Does the nuclear industry have too much power?                    If the information policy comes from one actor, there is a high                         0.666
                                                                   probability that not all information will reach the public
 Criticism of Fukushima                                            If the information policy comes from one actor, there is a high                         0.666
                                                                   probability that not all information will reach the public


found no connection between validating hypotheses sequentially               We calculated the agreement of pairs of hypotheses (h 1 , h 2 ) by
and using discrete phases for evidence collection and hypothesis          first, creating two copies of the un-annotated documents, one for
validation, e.g. users of a phased-approach also worked concurrently      each hypothesis; and second, annotating in the first copy only the
on multiple hypotheses.                                                   sentences that were annotated as evidence and linked to h 1 and in
                                                                          the second copy the ones that were linked to h 2 . We then calculated
                                                                          Cohen’s κ on these sentential annotations of the two copies.
                                                                             To understand the agreement on similar hypotheses, we asked a
                                                                          historian to select closely related pairs of hypotheses. The agree-
                                                                          ment in visible in table 1 at the top.
                                                                             In our second approach, we calculated the agreement of all pairs
  0                    55min         0                    55min
                                                                          of hypotheses from different users. This left us with 6050 hypothe-
                                                                          ses pairs. The bottom of table 1 shows all hypotheses pairs from
Figure 3: About half of the users validated multiple hypothe-
                                                                          different users that show a substantial agreement of κ > 0.6.
ses at the same time (left), while the others validated only
                                                                             Our results show that users who formulate similar hypotheses
one hypothesis at a time, not getting back afterwards (right).
                                                                          do not agree on the evidence and the same evidence can be used to
                                                                          validate vastly different hypotheses. This means that to maximise
   Most users (11 of 16) reported that they collected evidence first,     its usefulness, e.g. by avoiding to suggest uninteresting pieces of
and formulated their hypotheses later. However, while almost all          evidence, an ED method must adapt to the user.
users did start with the evidence collection task, many of them
formulated hypotheses very early in the task and linked evidence          4.4     Statistics on the evidence and hypotheses
to them at a later time, resembling a mixed approach. Only one
user reported to have used a mixed approach of collecting evidence
                                                                                  created by the users
and defining hypotheses.                                                  In the user study, we collected 827 evidence annotations, 114 unique
   The behaviour of the users shows a great variety in how they           hypotheses (two users formulated two identical hypotheses), and
develop and validate hypotheses. We also found that our current           516 links between evidence annotations and hypotheses3 . Table
user interface does not support the phase-free approach very well.        2 breaks the collected data down into the number sentences in
A user following the phase-free approach has to switch from the           the documents the user opened, evidence annotations, hypotheses,
Document view to the Hypothesis/Evidence view and back to link            overall, and the average number of links between hypotheses and
the just created evidence annotation to a hypothesis and collect the      evidence for each user.
next piece of evidence.                                                      The variability of the collected data mirrors the differences in
                                                                          the user’s behaviour. Some users created few annotations, whereas
4.3     Agreement of the users on evidence                                others created many. Equally variable is the number of hypotheses
We followed two approaches to understand how well the users               and links between hypotheses and evidence. For instance, user12
agreed on the evidence: (1) how well do the users agree on evidence       created only two hypotheses, one with 11 links and the other one
for similar hypotheses and (2) how similar are the hypotheses whose       3 We plan to publish EDoHa and the data together with a more detailed evaluation of

evidence shows a substantial agreement.                                   ED methods. Until then, the data is available upon request.
Pilot Experiments of Hypothesis Validation
Through Evidence Detection for Historians                                                                DESIRES 2018, August 2018, Bertinoro, Italy
Table 2: The number of evidence annotations, hypotheses,                      All hyperparameter optimisations were done on a development
and links between them varied greatly between users.                       user. We chose user7 because this user annotated much evidence,
                                                                           created multiple hypotheses, and validated them well; methods that
 User             Sentences Evidence          Hypotheses       Links       wouldn’t work for this user because they require more data would
 user0                 364       205               13            259       also not work for all the others.
 user1                 321        21                4             12
 user2                 403        79                3              0       5.1      Baselines and models trained on
 user3                 479        85                6              0                user-created data
 user4                 479        27                6             27
                                                                           As baseline methods, we chose a majority classifier and a random
 user5                 479        78                8             63
                                                                           classifier that learns the distribution of the training labels and
 user7                 479        74                7             70
                                                                           predicts randomly according to them. Additionally, we trained a
 user12                479        29                2             29
                                                                           Multi-Layer Perceptron (MLP) with one hidden layer of size 10 and
 user13                403        38                6             30
                                                                           a Naive Bayes classifier. Both models rely on a bag of words as
 user14                479        41                4             23
                                                                           features. The MLP and Naive Bayes classifiers were implemented
 user15                479        38                9             32
                                                                           using scikit-learn4 and stopwords were removed based on NLTK5 .
 user16                441        41                8             28
                                                                              We also considered the links between evidence and hypotheses
 user17                321        77               16             61
                                                                           as training data. This classifier (link(s, h)) was trained to predict the
 user18                291        45               12             44
                                                                           link between hypotheses and evidence. The negative samples for
 user19                479        44               11             56
                                                                           training were random links between evidence and hypotheses, and
 user20                328        38               12             21
                                                                           positive samples were the user-created links. We used an MLP with
                                                                           three hidden layers (100, 75, and 50 nodes) to predict the binary link
with 18. User18 on the other hand created 12 hypotheses and linked         between evidence and hypotheses. It used averaged word embed-
them with up to three evidence annotations. However, users who             dings in German trained on articles from the newspaper "Die Zeit"
created many hypotheses did not always create fewer links between          for a GermEval 2014 task on nested named entity recognition [9]. If
evidence and hypothesis than users who created few hypotheses,             this classifier detected a link between a sentence and a user-defined
as user7 demonstrates with 7 hypotheses and an average of 10               hypothesis, it considered the sentence a piece of evidence.
evidence links. Users 2 and 3 did not create any links between
evidence and hypotheses. Interactions with the participants during         5.2      Pre-trained models for argument mining
the study led us to believe that user2 did not understand the purpose      As AM model, we selected a bidirectional Long-Short Term Memory
of the study and treated it as a usability test in which the hypotheses    that uses a candidate sentence and the cosine similarity between
and evidence could not be connected. User3 may have missed the             the candidate and the topic as input. We trained it on the sentential
linking part of the introduction into EDoHa and may therefore have         AM corpus created by Stab et. al [14], limiting the data to the topic
been unaware of the Drag & Drop functionality.                             of nuclear energy. To adapt the model to the German language, we
                                                                           translated the sentences into German using an external machine
5    EVIDENCE DETECTION EXPERIMENTS                                        translation API6 similar to [3]. The model reached a macro F1-score
We treated ED as a binary classification task, evidence vs. no evidence,   of 0.714 in a binary in-topic classification task of argument vs. no
on the sentence level and report the standard metrics (precision,          argument. In our ED task, we treated sentences which the model
recall, and F1-score). We are especially interested in the precision       classified as argumentative as evidence.
on the evidence class, because suggesting pieces of evidence that
the user is not interested in means additional work for corrections,       5.3      User data augmented models
thereby reducing the acceptance of the system. When reporting the          To investigate whether the user-created data can be used to augment
results on both classes, evidence and no evidence, we caclulated the       a pre-trained model, we developed three approaches that combined
macro-average precision and recall and macro-averaged F1-score             the user-created data with the best performing pre-trained model.
from them.                                                                 We used the following methods to reduce the number of false ev-
    We evaluated multiple baselines, models trained on the data            idence suggestions by filtering the predictions of the pre-trained
of individual users, pre-trained models, and combinations of pre-          model with:
trained models with filters that were derived from the user-created        +cos(h, s) Cosine similarity between hypothesis and predicted evi-
data.                                                                      dence < 0.7.
    Based on our previous finding that each user requires a unique         +ignore < 60s A heuristic that ignores all predictions on files the
ED model, we ran the experiments for each user separately. We              user did not open for at least 60s, because a user may not spend
conducted the experiments in a leave-one-document-out fashion,             much time reading documents that are deemed irrelevant.
i.e. in each fold we used one document for testing and the others as       +link(h, s) Prediction of a link between the evidence predicted by
training documents; we ignored documents the user did not open.            the pre-trained model and any hypothesis the user created.
When evaluating a non-deterministic model, e.g. neural networks            4 https://scikit-learn.org/stable/
or a random baseline, we repeated the experiment five times and            5 https://www.nltk.org/

averaged the results.                                                      6 We chose the Google Translate API because of the quality of the translations.
DESIRES 2018, August 2018, Bertinoro, Italy

Table 3: Results in the ED task were averaged across users with standard deviation in parentheses. The bottom shows combi-
nations of the best performing model with additional user generated data. A † indicates a statistically significant difference
to the random baseline and ‡ indicates a statistically significant difference to the AM model. Both significances are calculated
across users using a Wilcoxon signed rank test with Pratts’ modification with zero rank splitting and a threshold of p < 0.05.

                                      Evidence & No Evidence                                                   Evidence only
                             Macro F1           Macro P            Macro R                      F1                      P                       R
      Majority             † 0.462 (0.032)    † 0.433 (0.049)    † 0.500 (0.000)         † 0.051 (0.186)        † 0.040 (0.145)         † 0.071 (0.258)
      Random                 0.491 (0.013)      0.491 (0.012)      0.491 (0.013)           0.126 (0.132)          0.127 (0.131)           0.126 (0.133)
      MLP                  † 0.526 (0.037)    † 0.538 (0.060)    † 0.516 (0.019)         † 0.132 (0.138)        † 0.213 (0.158)         † 0.104 (0.129)
      NaiveBayes             0.506 (0.029)      0.506 (0.024)      0.507 (0.035)           0.169 (0.123)          0.151 (0.139)           0.202 (0.119)
      cos(h, s)            † 0.506 (0.143)    † 0.505 (0.144)    † 0.508 (0.145)         † 0.217 (0.169)        † 0.152 (0.145)        † 0.768 (0.328)
      link(h, s)             0.441 (0.136)      0.445 (0.140)      0.444 (0.141)           0.123 (0.101)          0.091 (0.075)           0.300 (0.238)
      AM                  † 0.574 (0.020)    † 0.548 (0.024)    † 0.604 (0.026)         † 0.265 (0.101)        † 0.208 (0.154)          † 0.511 (0.059)
      AM+cos(h, s)         ‡ 0.489 (0.139)    ‡ 0.476 (0.133)    ‡ 0.502 (0.146)         ‡ 0.206 (0.119)         ‡ 0.146 (0.134)       ‡ 0.516 (0.158)
      AM+ignore < 60s       0.585 (0.037)      0.560 (0.041)      0.613 (0.040)           0.282 (0.113)           0.230 (0.164)           0.490 (0.077)
      AM+link(h, s)        ‡ 0.450 (0.129)    ‡ 0.454 (0.129)    ‡ 0.446 (0.131)         ‡ 0.182 (0.129)         ‡ 0.124 (0.119)        ‡ 0.492 (0.154)


5.4    Results                                                             even though all participants were given the same task, each of them
Table 3 shows that the AM model performs best among the baselines.         created unique hypotheses and evidence annotations. Furthermore,
However, the differences between it and the Random baseline were           similar hypotheses were not supported with the same evidence and
generally not statistically significant.                                   the same evidence was used to support different hypotheses. Given
   Among the models trained on the user data, the MLP outper-              that pre-training an ED for each user is infeasible we conclude
formed all others with respect to precision on the evidence class          that in ED for humanities researchers the model has to be trained
followed by the Naive Bayes and cos(h, s). The cos(h, s) model             interactively by the user. When applying a state-of-the-art AM
reached the overall highest recall on the evidence class, but did so       model to the task of ED we found that it performed better than
at the cost of predicting many false positives.                            the models trained on the user’s data; we therefore conclude that a
   The link(h, s) model performed unexpectedly low, which can be           pre-trained AM model can serve as a starting point for adapting an
due to the nature of the training data. Because the training data          ED to individual users.
consisted of positive links between evidence and hypotheses, the              In the future, we intend to improve the ED methods, use a more
negative samples were drawn from existing evidence, just paired            realistic setup rather than leave-one-document-out, e.g. by predict-
with a random hypothesis. The training data therefore did not              ing the annotations the user is going to do next, and collect evidence
contain any sentence that the user did not annotate as evidence,           and hypotheses from users working with EDoHa for longer than
leading the model to predict greetings as evidence.                        one hour.
   Contrary to our initial assumption, some users did create ev-
idence annotations in documents that they opened for less than             ACKNOWLEDGMENTS
the 60s. This resulted in the drop in recall between the AM and            This work has been supported by the German Research Founda-
AM+ignore < 60s model. Nevertheless, ignoring the files that the           tion (DFG) as part of the Research Training Group KRITIS No.
user did not open for more than 60s did improve the performance            GRK 2222/1, by the German Federal Ministry of Education and
significantly in any measure except the recall on the evidence class.      Research (BMBF) under the promotional reference 03VP02540 (Ar-
   Overall, no method achieved sufficient results meaning that their       gumenText), and by the German Federal Ministry of Education and
integration into an ED tool is not yet feasible. Especially the low        Research under the promotional reference 01UG1816B (CEDIFOR).
precision on the evidence class would discourage any adoption.
However, the performance of the pre-trained AM model is promis-
ing regarding further training to adapt an ED model to individual          REFERENCES
                                                                             [1] Aseel Addawood and Masooda Bashir. 2016. “What Is Your Evidence?” A Study
users.                                                                           of Controversial Topics on Social Media. In Proceedings of the Third Workshop on
                                                                                 Argument Mining (ArgMining2016). Association for Computational Linguistics,
                                                                                 Berlin, Germany, 1–11. https://doi.org/10.18653/v1/W16-2801
6     CONCLUSION                                                             [2] Richard Eckart de Castilho, Eva Mujdricza-Maydt, Seid Muhie Yimam, Silvana
In this paper we have presented the first prototype of an evidence               Hartmann, Iryna Gurevych, Anette Frank, and Chris Biemann. 2016. A Web-
                                                                                 Based Tool for the Integrated Annotation of Semantic and Syntactic Structures.
detection and hypothesis validation tool developed with humanities               In Proceedings of the Workshop on Language Technology Resources and Tools for
researchers as users in mind. We conducted a user study with                     Digital Humanities (LT4DH). 76–84.
bachelor students to understand how they develop and validate                [3] Steffen Eger, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2018.
                                                                                 Cross-Lingual Argumentation Mining: Machine Translation (and a Bit of Pro-
their hypotheses in history and found that users vary greatly when               jection) Is All You Need!. In Proceedings of the 27th International Conference on
collecting evidence and validating hypotheses. We also found that                Computational Linguistics (COLING 2018). to appear.
Pilot Experiments of Hypothesis Validation
Through Evidence Detection for Historians                                                DESIRES 2018, August 2018, Bertinoro, Italy

 [4] Stefan Feyer, Sophie Siebert, Bela Gipp, Akiko Aizawa, and Joeran Beel. 2017.
     Integration of the Scientific Recommender System Mr. DLib into the Reference
     Manager JabRef. In Advances in Information Retrieval (Lecture Notes in Computer
     Science). Springer, Cham, Aberdeen, Scotland UK, 770–774. https://doi.org/10.
     1007/978-3-319-56608-5_80
 [5] Matthias Hagen, Anna Beyer, Tim Gollub, Kristof Komlossy, and Benno Stein.
     2016. Supporting Scholarly Search with Keyqueries. In Advances in Information
     Retrieval (Lecture Notes in Computer Science). Springer, Cham, Padua, Italy, 507–
     520. https://doi.org/10.1007/978-3-319-30671-1_37
 [6] Xinyu Hua and Lu Wang. 2017. Understanding and Detecting Supporting Ar-
     guments of Diverse Types. In Proceedings of the 55th Annual Meeting of the
     Association for Computational Linguistics (Volume 2: Short Papers). Association
     for Computational Linguistics, Vancouver, Canada, 203–208.
 [7] Mio Kobayashi, Ai Ishii, Chikara Hoshino, Hiroshi Miyashita, and Takuya Mat-
     suzaki. 2017. Automated Historical Fact-Checking by Passage Retrieval, Word
     Statistics, and Virtual Question-Answering. In Proceedings of the Eighth Interna-
     tional Joint Conference on Natural Language Processing (Volume 1: Long Papers).
     Asian Federation of Natural Language Processing, Taipei, Taiwan, 967–975.
 [8] Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni, and Noam Slonim.
     2014. Context Dependent Claim Detection. In Proceedings of COLING 2014,
     the 25th International Conference on Computational Linguistics: Technical Papers.
     Dublin City University and Association for Computational Linguistics, Dublin,
     Ireland, 1489–1500.
 [9] Nils Reimers, Judith Eckle-Kohler, Carsten Schnober, Jungi Kim, and Iryna
     Gurevych. 2014. GermEval-2014: Nested Named Entity Recognition with Neural
     Networks. In Workshop Proceedings of the 12th Edition of the KONVENS Confer-
     ence, Gertrud Faaß and Josef Ruppenhofer (Eds.). Universitätsverlag Hildesheim,
     Hildesheim, Germany, 117–120.
[10] Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni,
     and Noam Slonim. 2015. Show Me Your Evidence - an Automatic Method for
     Context Dependent Evidence Detection. In Proceedings of the 2015 Conference on
     Empirical Methods in Natural Language Processing. Association for Computational
     Linguistics, Lisbon, Portugal, 440–450.
[11] Carsten Schnober and Iryna Gurevych. 2015. Combining Topic Models for
     Corpus Exploration: Applying LDA for Complex Corpus Research Tasks in a
     Digital Humanities Project. In Proceedings of the 2015 Workshop on Topic Models:
     Post-Processing and Applications (TM ’15). ACM, New York, NY, USA, 11–20.
     https://doi.org/10.1145/2809936.2809939
[12] Amin Sorkhei, Kalle Ilves, and Dorota Glowacka. 2017. Exploring Scientific
     Literature Search Through Topic Models. In Proceedings of the 2017 ACM Workshop
     on Exploratory Search and Interactive Data Analytics (ESIDA ’17). ACM, Limassol,
     Cyprus, 65–68. https://doi.org/10.1145/3038462.3038464
[13] Christian Stab and Iryna Gurevych. 2017. Parsing Argumentation Structures in
     Persuasive Essays. Computational Linguistics 43, 3 (Sept. 2017), 619–659.
[14] Christian Stab, Tristan Miller, and Iryna Gurevych. 2018. Cross-Topic Argument
     Mining from Heterogeneous Sources Using Attention-Based Neural Networks.
     arXiv:1802.05758 [cs] (Feb. 2018). arXiv:cs/1802.05758
[15] Andreas Vlachos and Sebastian Riedel. 2014. Fact Checking: Task Definition
     and Dataset Construction. In Proceedings of the ACL 2014 Workshop on Language
     Technologies and Computational Social Science. Association for Computational
     Linguistics, Baltimore, MD, USA, 18–22. https://doi.org/10.3115/v1/W14-2508