Pilot Experiments of Hypothesis Validation Through Evidence Detection for Historians Chris Stahlhut†‡, Christian Stab†, Iryna Gurevych† †Ubiquitous Knowledge Processing Lab ‡ Research Training Group KRITIS Darmstadt, Germany www.ukp.tu-darmstadt.de ABSTRACT brings a profit of 1 million Euros." Afterwards, the historian then Historians spend a large amount of time in archives reading doc- continues to look for more evidence supporting or attacking the hy- uments to pick out the small text quote they can use as evidence. pothesis in the vast number of documents of the political discourse This is a time consuming task that automated evidence detection and, if necessary, revises the hypothesis. promises to speed up significantly. However, no evidence detec- In this example, the historian can benefit greatly from document tion method has been tested on a dataset that contains hypotheses retrieval as it would dramatically reduce the number of irrelevant and evidence created by humanities researchers. Furthermore, no documents to read. However, this will only help to create the bibli- research has yet been conducted to understand how historians ap- ography of the source material which can still be very long. Picking proach this task of developing hypotheses and finding evidence. out the few pieces of evidence contained even within this reduced In this paper, we analyse the behaviour of 16 students of the hu- number of documents still takes a lot of time. This task of finding manities in developing and validating hypotheses and show that textual sources, or evidence, relevant to a given hypothesis or claim there is no canonical user; even when given the same exercise, they is researched under the name of Evidence Detection (ED). develop different hypotheses and annotate different text snippets While ED is extensively studied in the research field of Argument as evidence; and current state-of-the-art argument mining methods Mining (AM) [6, 10], all existing methods are trained once on a are not suitable for historical validation of hypotheses. We there- fixed set of training examples; and rarely an approach focusses on fore conclude that an evidence detection method must be trained researchers in the humanities as users, let alone how they develop interactively to adapt to the user’s needs. and validate their hypotheses. Moreover, hypotheses might change over time, providing an additional challenge for static models. KEYWORDS In this paper, we present for the first time (1) an analysis on how scholars in the humanities develop and validate their hypotheses, argument mining, evidence detection, hypothesis validation, infor- (2) an analysis on the agreement of the evidence annotated by the mation retrieval scholars, and (3) the results of applying a state-of-the-art argument 1 INTRODUCTION mining model for ED in the context of humanities research. Research in humanities involves searching relevant information 2 RELATED WORK in huge text collections. Say, a historian analyses the political dis- course after the Chernobyl and Fukushima catastrophes because Existing approaches in ED focus on finding pieces of evidence to he or she is working on a project about the economic development support a claim and classify their type, e.g. as statistics, expert of the energy infrastructure in the second half of the 20th century. opinions, or anecdotal evidence. This can be done to find evidence He or she will spend countless hours carefully studying protocols that supports a claim [6, 10] or to analyse the evidence used in of political speeches and other documents, most of which do not online debates [1]. contain any relevant information. While reading the transcript of AM can be separated into two different approaches, namely dis- a particular speech, the historian formulates the hypothesis "Ex- course level AM and information seeking AM. The former detects tending the runtime of nuclear reactors is a monetary source of arguments inside the document structure, e.g. persuasive essays income1 ". This figurative historian then goes back to the text he [13]. The latter detects arguments depending on the predefined con- or she read previously to pick out the text snippets, or evidence, text [8], e.g. in the case of ED which hypothesis a piece of evidence that lead him or her to formulate the hypothesis. For instance, the is related to. statement "A depreciated nuclear reactor that runs one day longer, Fact checking [15] is a related field of growing interest in re- search. Its goal is to find factual evidence for or against testable 1 All examples were formulated by participants of a user study in German and translated to English by us. statements, for instance on historical events in high school student tests [7]. Neither of these approaches allow for personalisation or Permission to make digital or hard copies of part or all of this work for personal or focus on researchers in the humanities as users. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation One area of focus in information retrieval is on supporting aca- on the first page. Copyrights for third-party components of this work must be honored. demic work, e.g. by finding related academic literature [5], discov- For all other uses, contact the owner/author(s). ering new [12], and recommending literature [4]. While supporting DESIRES 2018, August 2018, Bertinoro, Italy © 2018 Copyright held by the owner/author(s). academics in finding documents, neither of these approaches con- siders the evidence contained with the relevant documents. DESIRES 2018, August 2018, Bertinoro, Italy 4 ⃝ 3 ⃝ 1 ⃝ 2 ⃝ Figure 1: Screenshot of EDoHa. The Hypotheses/Evidences view allows a user to define hypotheses and link previously anno- tated evidence to it. Existing approaches that work on a sub-document resolution are user selects another document, the evidence annotations are limited to supporting corpus exploration, for instance, by showing replaced with the ones from the newly selected document. how the relevance of topics changes over time [11]. 3 A list of available documents in which users can annotate the ⃝ evidence. The currently selected document will be shown in green colour to signify its selection. If the user wishes to 3 EDOHA see the evidence annotations from all documents, the button "Clear Selection" at the top right corner of the document list We developed EDoHa (Evidence Detection fOr Hypothesis vAlida- unselects the current document so that the list of evidence tion) with the goal of enabling a user to validate their hypotheses annotations is no longer limited to a single document. The with evidence they annotated in a collection of documents. Figure 1 visible hypotheses and their linked evidence are unaffected shows a screenshot of EDoHa in which a user already defined sev- by this change. 4 A Document view in which a user can select the evidence in ⃝ eral hypotheses and created multiple links between hypotheses and evidence. We based EDoHa on the annotation tool WebAnno [2] the source documents (not visible in the screenshot). but developed a user interface which focusses more on casual than expert users. It consists of the following components: Interviews with historians during development showed that they require to see from which document a particular piece of evidence 1 The Hypothesis/Evidence view allows the user to define and ⃝ originates. We therefore added a highlighting mechanism to the list revise hypotheses. In this screenshot, it shows three hypothe- of available documents and Hypotheses/Evidence view. If a user ses next to each other and the evidence annotations linked hovers the cursor over an evidence, as illustrated at the bottom of to them. The hypothesis is the header and each evidence an- the screenshot, the document source of the evidence and all hy- notation linked to it is one multi-row cell beneath. Clicking potheses this piece of evidence is linked to will be highlighted with a the ⊗ next an evidence annotation deletes the link between dashed frame. The currently selected document will be highlighted the evidence annotation and the hypothesis; clicking the ⊗ with a green frame. next to the hypothesis deletes the hypothesis. 2 A list with all evidence annotations, each of which the user ⃝ can link to one or more hypotheses via Drag & Drop. To avoid 4 USER STUDY showing too much evidence so that the user needs to search To understand how researchers in the humanities develop and for them, the list of evidence annotations limits its elements validate their hypotheses and how well they agree on the evidence, to the ones from the currently selected document. When the we conducted a user study with students of the humanities. Pilot Experiments of Hypothesis Validation Through Evidence Detection for Historians DESIRES 2018, August 2018, Bertinoro, Italy 4.1 Setup User2 User3 We conducted the user study in the context of a historical semi- nar on environmental catastrophes in the second half of the 20th century. The participants of this seminar were students of history, political science, or sociology in the second or third year of their bachelor studies. The seminar covered different historic events, User14 User7 such as the Chernobyl meltdown and topics of modern history, such as Waldsterben. The study took place one week after a stu- dent’s presentation on the Chernobyl meltdown. The students were asked to compare the argumentation on nu- clear energy after the Chernobyl meltdown with the argumentation User20 User0 after the Fukushima catastrophe. We prepared 9 political speeches from the German parliament with an overall length of 479 sen- tences, 4 after the Chernobyl meltdown and 5 after the Fukushima catastrophe, for the students to analyse, formulate hypotheses, and validate them. The students were able to read all speeches one week User15 User12 beforehand to familiarise themselves with the texts. However, we did not disclose the task of the exercise to them. Before letting the students work on the task, we gave a short introduction into the usage of EDoHa. Afterwards, we handed out the exercise and answered all questions the students had regarding User13 User1 it.2 The students had one hour for the exercise followed by filling out a questionnaire about their approach to evidence detection and hypothesis validation, whether or not they would like to use EDoHa in their studies, and how to improve. The session ended with a discussion of the student’s findings. User5 User18 During the experiment, we logged multiple interactions of the users with the system to understand how they develop and validate hypotheses. These interactions are: clicking on a document in the list of available documents, creating and deleting evidence annotations in documents, creating and deleting evidence/hypothesis links, creating User16 User4 and updating hypotheses (reformulating or deleting hypotheses), and changing the view in the interface. 4.2 User behaviour 100 100 We used the previously described logs to understand their general User19 User17 approach to developing and validating hypotheses. Figure 2 shows the variability of how users annotate evidence and link them to hypotheses. The upper six plots show a strong separation into distinct phases of evidence collection and hypotheses validation, or a phased approach. 0 55min 0 55min The first two users never reach the hypothesis validation phase, but the following four always start by collecting multiple evidence Figure 2: The number of evidence annotations (+), evi- annotations and then linking them to one or more hypotheses. dence/hypothesis links (x) for validation, and hypotheses (•) Afterwards, they continue to collect more evidence. over time. The users are ordered by the number of times they At the bottom we see that user19 and user17 showed no such changed between the Document and Hypotheses/Evidence distinction, i.e. these users used a phase-free approach. They create view. one evidence annotation and link it immediately to a hypothesis. Afterwards they create the next evidence annotation and link this one. In the middle, we see a transition from users with a phased The user’s approach towards which hypotheses they validated approach towards a phase-free approach. also fell into two categories. Figure 3 shows each hypothesis as a 2 The exercise sheet also contained the login credentials of previously created accounts layer where its thickness represents how much evidence is linked (user0 – user20) and cannot be traced back to individual students. It is our understand- to it. About half of the users validated multiple hypotheses at the ing of the regulations at our institution that an ethics approval is only required when same time, or concurrently (figure 3 left), whereas the other half processing personally identifiable information. Being aware of the delicate nature of such data, we decided to not collect any personally identifiable information and validated the hypotheses sequentially, creating links to evidence for designed the study to be anonymised as described above. one hypothesis at a time, never returning to it (figure 3 right). We DESIRES 2018, August 2018, Bertinoro, Italy Table 1: Agreement on evidence of similar hypotheses (top) and all hypotheses with a substantial agreement (bottom). Hypotheses pair Cohen’s κ International security arrangements in the nuclear sector Nuclear power and security: 0.116 are necessary further expansion of domestic and foreign policy Nuclear-phaseout is not possible due to the profit motive Profit maximisation of the economy 0.067 of corporations Chernobyl as a reminder for the nuclear-phaseout Chernobyl and Fukushima repeatedly related 0.057 Nuclear phase-out should not be slowed down Is money and the economy put on the safety of each one? -0.007 by individual companies Does the nuclear industry have too much power? Criticism of Fukushima 1.000 Security of nuclear reactors must be guaranteed The following security measurements 0.748 Does the nuclear industry have too much power? If the information policy comes from one actor, there is a high 0.666 probability that not all information will reach the public Criticism of Fukushima If the information policy comes from one actor, there is a high 0.666 probability that not all information will reach the public found no connection between validating hypotheses sequentially We calculated the agreement of pairs of hypotheses (h 1 , h 2 ) by and using discrete phases for evidence collection and hypothesis first, creating two copies of the un-annotated documents, one for validation, e.g. users of a phased-approach also worked concurrently each hypothesis; and second, annotating in the first copy only the on multiple hypotheses. sentences that were annotated as evidence and linked to h 1 and in the second copy the ones that were linked to h 2 . We then calculated Cohen’s κ on these sentential annotations of the two copies. To understand the agreement on similar hypotheses, we asked a historian to select closely related pairs of hypotheses. The agree- ment in visible in table 1 at the top. In our second approach, we calculated the agreement of all pairs 0 55min 0 55min of hypotheses from different users. This left us with 6050 hypothe- ses pairs. The bottom of table 1 shows all hypotheses pairs from Figure 3: About half of the users validated multiple hypothe- different users that show a substantial agreement of κ > 0.6. ses at the same time (left), while the others validated only Our results show that users who formulate similar hypotheses one hypothesis at a time, not getting back afterwards (right). do not agree on the evidence and the same evidence can be used to validate vastly different hypotheses. This means that to maximise Most users (11 of 16) reported that they collected evidence first, its usefulness, e.g. by avoiding to suggest uninteresting pieces of and formulated their hypotheses later. However, while almost all evidence, an ED method must adapt to the user. users did start with the evidence collection task, many of them formulated hypotheses very early in the task and linked evidence 4.4 Statistics on the evidence and hypotheses to them at a later time, resembling a mixed approach. Only one user reported to have used a mixed approach of collecting evidence created by the users and defining hypotheses. In the user study, we collected 827 evidence annotations, 114 unique The behaviour of the users shows a great variety in how they hypotheses (two users formulated two identical hypotheses), and develop and validate hypotheses. We also found that our current 516 links between evidence annotations and hypotheses3 . Table user interface does not support the phase-free approach very well. 2 breaks the collected data down into the number sentences in A user following the phase-free approach has to switch from the the documents the user opened, evidence annotations, hypotheses, Document view to the Hypothesis/Evidence view and back to link overall, and the average number of links between hypotheses and the just created evidence annotation to a hypothesis and collect the evidence for each user. next piece of evidence. The variability of the collected data mirrors the differences in the user’s behaviour. Some users created few annotations, whereas 4.3 Agreement of the users on evidence others created many. Equally variable is the number of hypotheses We followed two approaches to understand how well the users and links between hypotheses and evidence. For instance, user12 agreed on the evidence: (1) how well do the users agree on evidence created only two hypotheses, one with 11 links and the other one for similar hypotheses and (2) how similar are the hypotheses whose 3 We plan to publish EDoHa and the data together with a more detailed evaluation of evidence shows a substantial agreement. ED methods. Until then, the data is available upon request. Pilot Experiments of Hypothesis Validation Through Evidence Detection for Historians DESIRES 2018, August 2018, Bertinoro, Italy Table 2: The number of evidence annotations, hypotheses, All hyperparameter optimisations were done on a development and links between them varied greatly between users. user. We chose user7 because this user annotated much evidence, created multiple hypotheses, and validated them well; methods that User Sentences Evidence Hypotheses Links wouldn’t work for this user because they require more data would user0 364 205 13 259 also not work for all the others. user1 321 21 4 12 user2 403 79 3 0 5.1 Baselines and models trained on user3 479 85 6 0 user-created data user4 479 27 6 27 As baseline methods, we chose a majority classifier and a random user5 479 78 8 63 classifier that learns the distribution of the training labels and user7 479 74 7 70 predicts randomly according to them. Additionally, we trained a user12 479 29 2 29 Multi-Layer Perceptron (MLP) with one hidden layer of size 10 and user13 403 38 6 30 a Naive Bayes classifier. Both models rely on a bag of words as user14 479 41 4 23 features. The MLP and Naive Bayes classifiers were implemented user15 479 38 9 32 using scikit-learn4 and stopwords were removed based on NLTK5 . user16 441 41 8 28 We also considered the links between evidence and hypotheses user17 321 77 16 61 as training data. This classifier (link(s, h)) was trained to predict the user18 291 45 12 44 link between hypotheses and evidence. The negative samples for user19 479 44 11 56 training were random links between evidence and hypotheses, and user20 328 38 12 21 positive samples were the user-created links. We used an MLP with three hidden layers (100, 75, and 50 nodes) to predict the binary link with 18. User18 on the other hand created 12 hypotheses and linked between evidence and hypotheses. It used averaged word embed- them with up to three evidence annotations. However, users who dings in German trained on articles from the newspaper "Die Zeit" created many hypotheses did not always create fewer links between for a GermEval 2014 task on nested named entity recognition [9]. If evidence and hypothesis than users who created few hypotheses, this classifier detected a link between a sentence and a user-defined as user7 demonstrates with 7 hypotheses and an average of 10 hypothesis, it considered the sentence a piece of evidence. evidence links. Users 2 and 3 did not create any links between evidence and hypotheses. Interactions with the participants during 5.2 Pre-trained models for argument mining the study led us to believe that user2 did not understand the purpose As AM model, we selected a bidirectional Long-Short Term Memory of the study and treated it as a usability test in which the hypotheses that uses a candidate sentence and the cosine similarity between and evidence could not be connected. User3 may have missed the the candidate and the topic as input. We trained it on the sentential linking part of the introduction into EDoHa and may therefore have AM corpus created by Stab et. al [14], limiting the data to the topic been unaware of the Drag & Drop functionality. of nuclear energy. To adapt the model to the German language, we translated the sentences into German using an external machine 5 EVIDENCE DETECTION EXPERIMENTS translation API6 similar to [3]. The model reached a macro F1-score We treated ED as a binary classification task, evidence vs. no evidence, of 0.714 in a binary in-topic classification task of argument vs. no on the sentence level and report the standard metrics (precision, argument. In our ED task, we treated sentences which the model recall, and F1-score). We are especially interested in the precision classified as argumentative as evidence. on the evidence class, because suggesting pieces of evidence that the user is not interested in means additional work for corrections, 5.3 User data augmented models thereby reducing the acceptance of the system. When reporting the To investigate whether the user-created data can be used to augment results on both classes, evidence and no evidence, we caclulated the a pre-trained model, we developed three approaches that combined macro-average precision and recall and macro-averaged F1-score the user-created data with the best performing pre-trained model. from them. We used the following methods to reduce the number of false ev- We evaluated multiple baselines, models trained on the data idence suggestions by filtering the predictions of the pre-trained of individual users, pre-trained models, and combinations of pre- model with: trained models with filters that were derived from the user-created +cos(h, s) Cosine similarity between hypothesis and predicted evi- data. dence < 0.7. Based on our previous finding that each user requires a unique +ignore < 60s A heuristic that ignores all predictions on files the ED model, we ran the experiments for each user separately. We user did not open for at least 60s, because a user may not spend conducted the experiments in a leave-one-document-out fashion, much time reading documents that are deemed irrelevant. i.e. in each fold we used one document for testing and the others as +link(h, s) Prediction of a link between the evidence predicted by training documents; we ignored documents the user did not open. the pre-trained model and any hypothesis the user created. When evaluating a non-deterministic model, e.g. neural networks 4 https://scikit-learn.org/stable/ or a random baseline, we repeated the experiment five times and 5 https://www.nltk.org/ averaged the results. 6 We chose the Google Translate API because of the quality of the translations. DESIRES 2018, August 2018, Bertinoro, Italy Table 3: Results in the ED task were averaged across users with standard deviation in parentheses. The bottom shows combi- nations of the best performing model with additional user generated data. A † indicates a statistically significant difference to the random baseline and ‡ indicates a statistically significant difference to the AM model. Both significances are calculated across users using a Wilcoxon signed rank test with Pratts’ modification with zero rank splitting and a threshold of p < 0.05. Evidence & No Evidence Evidence only Macro F1 Macro P Macro R F1 P R Majority † 0.462 (0.032) † 0.433 (0.049) † 0.500 (0.000) † 0.051 (0.186) † 0.040 (0.145) † 0.071 (0.258) Random 0.491 (0.013) 0.491 (0.012) 0.491 (0.013) 0.126 (0.132) 0.127 (0.131) 0.126 (0.133) MLP † 0.526 (0.037) † 0.538 (0.060) † 0.516 (0.019) † 0.132 (0.138) † 0.213 (0.158) † 0.104 (0.129) NaiveBayes 0.506 (0.029) 0.506 (0.024) 0.507 (0.035) 0.169 (0.123) 0.151 (0.139) 0.202 (0.119) cos(h, s) † 0.506 (0.143) † 0.505 (0.144) † 0.508 (0.145) † 0.217 (0.169) † 0.152 (0.145) † 0.768 (0.328) link(h, s) 0.441 (0.136) 0.445 (0.140) 0.444 (0.141) 0.123 (0.101) 0.091 (0.075) 0.300 (0.238) AM † 0.574 (0.020) † 0.548 (0.024) † 0.604 (0.026) † 0.265 (0.101) † 0.208 (0.154) † 0.511 (0.059) AM+cos(h, s) ‡ 0.489 (0.139) ‡ 0.476 (0.133) ‡ 0.502 (0.146) ‡ 0.206 (0.119) ‡ 0.146 (0.134) ‡ 0.516 (0.158) AM+ignore < 60s 0.585 (0.037) 0.560 (0.041) 0.613 (0.040) 0.282 (0.113) 0.230 (0.164) 0.490 (0.077) AM+link(h, s) ‡ 0.450 (0.129) ‡ 0.454 (0.129) ‡ 0.446 (0.131) ‡ 0.182 (0.129) ‡ 0.124 (0.119) ‡ 0.492 (0.154) 5.4 Results even though all participants were given the same task, each of them Table 3 shows that the AM model performs best among the baselines. created unique hypotheses and evidence annotations. Furthermore, However, the differences between it and the Random baseline were similar hypotheses were not supported with the same evidence and generally not statistically significant. the same evidence was used to support different hypotheses. Given Among the models trained on the user data, the MLP outper- that pre-training an ED for each user is infeasible we conclude formed all others with respect to precision on the evidence class that in ED for humanities researchers the model has to be trained followed by the Naive Bayes and cos(h, s). The cos(h, s) model interactively by the user. When applying a state-of-the-art AM reached the overall highest recall on the evidence class, but did so model to the task of ED we found that it performed better than at the cost of predicting many false positives. the models trained on the user’s data; we therefore conclude that a The link(h, s) model performed unexpectedly low, which can be pre-trained AM model can serve as a starting point for adapting an due to the nature of the training data. Because the training data ED to individual users. consisted of positive links between evidence and hypotheses, the In the future, we intend to improve the ED methods, use a more negative samples were drawn from existing evidence, just paired realistic setup rather than leave-one-document-out, e.g. by predict- with a random hypothesis. The training data therefore did not ing the annotations the user is going to do next, and collect evidence contain any sentence that the user did not annotate as evidence, and hypotheses from users working with EDoHa for longer than leading the model to predict greetings as evidence. one hour. Contrary to our initial assumption, some users did create ev- idence annotations in documents that they opened for less than ACKNOWLEDGMENTS the 60s. This resulted in the drop in recall between the AM and This work has been supported by the German Research Founda- AM+ignore < 60s model. Nevertheless, ignoring the files that the tion (DFG) as part of the Research Training Group KRITIS No. user did not open for more than 60s did improve the performance GRK 2222/1, by the German Federal Ministry of Education and significantly in any measure except the recall on the evidence class. Research (BMBF) under the promotional reference 03VP02540 (Ar- Overall, no method achieved sufficient results meaning that their gumenText), and by the German Federal Ministry of Education and integration into an ED tool is not yet feasible. Especially the low Research under the promotional reference 01UG1816B (CEDIFOR). precision on the evidence class would discourage any adoption. However, the performance of the pre-trained AM model is promis- ing regarding further training to adapt an ED model to individual REFERENCES [1] Aseel Addawood and Masooda Bashir. 2016. “What Is Your Evidence?” A Study users. of Controversial Topics on Social Media. In Proceedings of the Third Workshop on Argument Mining (ArgMining2016). Association for Computational Linguistics, Berlin, Germany, 1–11. https://doi.org/10.18653/v1/W16-2801 6 CONCLUSION [2] Richard Eckart de Castilho, Eva Mujdricza-Maydt, Seid Muhie Yimam, Silvana In this paper we have presented the first prototype of an evidence Hartmann, Iryna Gurevych, Anette Frank, and Chris Biemann. 2016. A Web- Based Tool for the Integrated Annotation of Semantic and Syntactic Structures. detection and hypothesis validation tool developed with humanities In Proceedings of the Workshop on Language Technology Resources and Tools for researchers as users in mind. We conducted a user study with Digital Humanities (LT4DH). 76–84. bachelor students to understand how they develop and validate [3] Steffen Eger, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2018. Cross-Lingual Argumentation Mining: Machine Translation (and a Bit of Pro- their hypotheses in history and found that users vary greatly when jection) Is All You Need!. In Proceedings of the 27th International Conference on collecting evidence and validating hypotheses. We also found that Computational Linguistics (COLING 2018). to appear. Pilot Experiments of Hypothesis Validation Through Evidence Detection for Historians DESIRES 2018, August 2018, Bertinoro, Italy [4] Stefan Feyer, Sophie Siebert, Bela Gipp, Akiko Aizawa, and Joeran Beel. 2017. Integration of the Scientific Recommender System Mr. DLib into the Reference Manager JabRef. In Advances in Information Retrieval (Lecture Notes in Computer Science). Springer, Cham, Aberdeen, Scotland UK, 770–774. https://doi.org/10. 1007/978-3-319-56608-5_80 [5] Matthias Hagen, Anna Beyer, Tim Gollub, Kristof Komlossy, and Benno Stein. 2016. Supporting Scholarly Search with Keyqueries. In Advances in Information Retrieval (Lecture Notes in Computer Science). Springer, Cham, Padua, Italy, 507– 520. https://doi.org/10.1007/978-3-319-30671-1_37 [6] Xinyu Hua and Lu Wang. 2017. Understanding and Detecting Supporting Ar- guments of Diverse Types. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Vancouver, Canada, 203–208. [7] Mio Kobayashi, Ai Ishii, Chikara Hoshino, Hiroshi Miyashita, and Takuya Mat- suzaki. 2017. Automated Historical Fact-Checking by Passage Retrieval, Word Statistics, and Virtual Question-Answering. In Proceedings of the Eighth Interna- tional Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, 967–975. [8] Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni, and Noam Slonim. 2014. Context Dependent Claim Detection. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 1489–1500. [9] Nils Reimers, Judith Eckle-Kohler, Carsten Schnober, Jungi Kim, and Iryna Gurevych. 2014. GermEval-2014: Nested Named Entity Recognition with Neural Networks. In Workshop Proceedings of the 12th Edition of the KONVENS Confer- ence, Gertrud Faaß and Josef Ruppenhofer (Eds.). Universitätsverlag Hildesheim, Hildesheim, Germany, 117–120. [10] Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni, and Noam Slonim. 2015. Show Me Your Evidence - an Automatic Method for Context Dependent Evidence Detection. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 440–450. [11] Carsten Schnober and Iryna Gurevych. 2015. Combining Topic Models for Corpus Exploration: Applying LDA for Complex Corpus Research Tasks in a Digital Humanities Project. In Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications (TM ’15). ACM, New York, NY, USA, 11–20. https://doi.org/10.1145/2809936.2809939 [12] Amin Sorkhei, Kalle Ilves, and Dorota Glowacka. 2017. Exploring Scientific Literature Search Through Topic Models. In Proceedings of the 2017 ACM Workshop on Exploratory Search and Interactive Data Analytics (ESIDA ’17). ACM, Limassol, Cyprus, 65–68. https://doi.org/10.1145/3038462.3038464 [13] Christian Stab and Iryna Gurevych. 2017. Parsing Argumentation Structures in Persuasive Essays. Computational Linguistics 43, 3 (Sept. 2017), 619–659. [14] Christian Stab, Tristan Miller, and Iryna Gurevych. 2018. Cross-Topic Argument Mining from Heterogeneous Sources Using Attention-Based Neural Networks. arXiv:1802.05758 [cs] (Feb. 2018). arXiv:cs/1802.05758 [15] Andreas Vlachos and Sebastian Riedel. 2014. Fact Checking: Task Definition and Dataset Construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. Association for Computational Linguistics, Baltimore, MD, USA, 18–22. https://doi.org/10.3115/v1/W14-2508