Time-centric Exploration of Court Documents Philip Hausner Dennis Aumiller Institute of Computer Science Institute of Computer Science Heidelberg University, Germany Heidelberg University, Germany hausner@stud.uni-heidelberg.de aumiller@informatik.uni-heidelberg.de Michael Gertz Institute of Computer Science Heidelberg University, Germany gertz@informatik.uni-heidelberg.de Abstract Getting an overview of a complex phenomenon that is described in numerous documents poses a major challenge in many application do- mains, among which the legal domain is of particular societal interest. In this paper, we outline a framework that is based on constructing term co-occurrence networks from documents and that allows users to explore a collection of court documents in a time-centric fashion, thus providing insights into a case’s chronology and entities involved. 1 Introduction Lawyers and judges are often facing complex court cases that comprise hundreds of documents that cover charges, expert opinions, witness accounts, and the like. Prominent examples are well known in the context of the Enron scandal [Wik20b], the Panama papers [Wik20d], the Cum-Ex-Files [Wik20a], or the National Socialist Underground (NSU) trial [Wik20c]. Even though many of the documents are available in electronic form (mostly as PDFs), getting an overview of the case in terms of applicable statutory violations, relevant statutes, people and organizations involved as well as the temporal development of the case under consideration play a crucial element in the daily investigative business of a jurist. While typical Natural Language Processing tasks such as Named Entity Recognition already provide valuable information when extracted from court documents, the organization of these concepts to present a jurist an overview and starting point for further analyses and focused reading is still a challenge. This is particularly problematic as there is no default by which documents and texts can be arranged to provide for a comprehensive reading, a problem forensic search or e-Discovery systems used in the legal domain are also facing. In this paper, we outline a time-centric approach that aims to arrange key information from court documents using timelines in a flexible manner. The key idea is to construct weighted term/entity co-occurrence networks around temporal expressions detected in the texts. For the weighting, we introduce a TF-inverse timestamp frequency metric to score the relevance of temporal expressions, exploiting the natural time hierarchy (days, Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia (eds.): Proceedings of the Text2Story’20 Workshop, Lisbon, Portugal, 14-April-2020, published at http://ceur-ws.org 31 months, years). The constructed networks can be arranged along a timeline and allow for di↵erent exploration tasks, including the investigation of named entities at di↵erent points in time as well as temporally-centered zoom operations. In the following section, we briefly outline related work. In Section 3, we detail the time-centric network model, followed by experimental results based on documents from the above-mentioned NSU trial in Section 4. 2 Related Work Temporal information is inherent in many documents, and due to its wide variety of applications an important research subject. In information retrieval, for example, it is crucial for temporal clustering of documents or temporal question answering [ASBYG11]. Important for all these approaches is the accurate extraction and normalization of temporal expressions from textual data using state-of-the-art temporal taggers like HeidelTime, which is domain-sensitive and applicable to a wide variety of languages [SG13, SG15]. Furthermore, temporal information can be utilized for timeline summarization to give a compact overview of a topic. For example, Steen and Markert introduced an abstractive timeline summarization model that computes timelines completely unsupervised using multi-sentence-compression [SM19]. However, timelines are not widely used for exploratory tasks as can be for example seen in the survey of Campos et al. [CDJJ14]. Alonso et al. employed a time- line visualization for the exploration of search results [ABYG07], and Tuan et al. constructed timelines from Wikipedia articles and employed extracted contexts to summarize the events associated with an entity. Further- more, Prytkova et al. introduced a similar graph model to the one employed in this paper, although they did not formalize their approach in the form of a timeline [PSW12]. In the legal domain, Knight et al. were one of the first to consider temporal information [KMN98], and Lagos et al. discuss the value of timelines for legal case building [LSCO10]. Nonetheless, to the best of our knowledge not much research about time-based data has been done in the legal domain yet. Probably most similar to our work is the model of Spitz et al. who provide a weighted bipartite graph model that is partitioned into dates and other (non-date) terms, and that can be used for temporal analysis [SSBG15]. However, in their model only the relation between dates and other terms can be observed, while oftentimes the relation between terms around a timestamp is of relevance. The model proposed in this paper aims to achieve this by introducing a separate graph for each point in time. 3 Time-centric Graph Representation In this section, we establish a model that allows for the description of dates with the help of graphs by representing each date by its own network, employing node weights to express the importance of a term for a date. Ultimately, these graphs are utilized to construct the timeline visualizing the contents of a given document collection. 3.1 Time-Centric Graph Model Let P be a collection of documents (or pages). Moreover, each document p 2 P consists of a set of sentences s 2 p, and we denote the set of all sentences with S = [p2P {s | s 2 p}. A sentence in this model is treated as a bag of words, and while two sentences may contain identical words, they are treated as separate in this model. Additionally, some words carry temporal information, which can be extracted as dates d. The set of all dates present in the data set is denoted as D, and two dates are considered equal if they describe the same date (e.g., a year or day). Furthermore, to account for the di↵erences in the granularity of dates, we partition D into D = Dy [ Dm [ Dd where the indices denote years, months, and days, respectively. For this partitioning a hierarchy can be formulated, i.e., for each day exists a month in which it is included, and the same relation holds between months and years. 3.1.1 Time-centric Co-occurrence Graph Given a set of dates D, a time-centric co-occurrence graph is a weighted graph Gd = (Nd , Ld ) with nodes Nd being the terms extracted in a window of x sentences around timestamp d 2 D, and links Ld that represent the co-occurrences between terms in the same context around timestamp d. Since all terms in the context window around one instance of a timestamp d co-occur in this model, each subgraph extracted around a specific occurrence of a timestamp has to be fully connected. We denote the set of all time-centric co-occurrence graphs of D with GD = {Gd | d 2 D}. For each date d 2 D, there exists only one graph representing the date; this means in particular that there is not a separate graph for each occurrence of d, but co-occurrences around di↵erent 32 e1 e1 e2 e2 e1 2000 2002 2004 2006 2008 2010 Figure 1: Timeline employing time-centric co-occurrence graphs for three di↵erent points in time (green, red, blue). Each timestamp has a graph assigned that is visualized in its corresponding box indicated by matching colour. In each graph the central node represents the respective date, the rest are co-occurring terms. instances of d are aggregated in the same graph Gd . Taking into account the partitioning hierarchy described above, it can be stated that for each graph Gd1 that represents the network of a given day d1 and contains an edge e, e also exists in graph Gm1 of the month m1 containing d1 ; the same holds for months and years. We additionally define a function sent : Ld ! P(S ⇥ S) that assigns to each edge all pairs of sentences from which it was created, i.e., in which two (not necessarily distinct) sentences the two nodes co-occurred. sent enables exploration of the document collection by o↵ering a way to the user to show the relation between two terms, as well as their origin, which means their mutual co-occurrences, in the document collection. 3.1.2 Node and Edge Weightings Nodes as well as edges of a time-centric graph are assigned a weight. Edge weights are scaled by the number of times both terms co-occurred divided by the maximum number of co-occurrences in the graph. Node weights are computed by an adaption of the tf-idf weighting scheme we call term frequency - inverse timestamp frequency (tf-itf ) defined as: tf-itf(n, d, D) := tf(n, d) · itf(n, D), (1) where tf is the number of times term n occurs in the context window around timestamp d 2 D normalized by the total number of words occurring in the context windows, and itf is defined as ✓ ◆ M itf(n, D) = log , (2) 1 + |{d 2 D : tf(n, d) > 0}| with M being the number of unique timestamps in the document collection. A term has tf-itf rank m with regard to a time-centric co-occurrence graph, if it is the term with the m-th highest tf-itf. 3.2 Timeline Representations The resulting time-centric co-occurrence networks can be arranged on an appropriate timeline as indicated in Figure 1. Such a timeline can be e↵ectively utilized for a variety of exploration scenarios. In the following, we discuss two prominent use cases: Entity-centric timelines and zooming operations. An entity in this context is a named entity, i.e., a person, a location, and the like. 3.2.1 Entity-centric Timelines For entity-centric timelines, we do not take all timestamps into account for which a time-centric graph Gd is constructed, but only those time-centric graphs that exhibit a desirable property associated with the presence of one or more entities. While such a property can be highly complex, for data exploration tasks presented here, it is sufficient to stick with one of the two following criteria: 1. One or multiple entities E need to occur in the associated graph Gd , i.e., they are represented as a node in the network. 2. One or more edges e between certain entities have to exist in Gd , i.e., they have to be directly connected for the timestamp being part of the timeline. 33 (a) (b) (c) Figure 2: Properties of the NSU court trial. (a) Log-log plot of the occurrence distribution over entities, and (b) over timestamps. (c) Year occurrence distribution considering the years 1950 to 2050 with a logarithmic scaling on the y-axis. An example can be constructed with the help of Figure 1: Establishing a timeline using the first criterion and requiring entities e1 and e2 to be in the networks, yields the left (green) and right (blue) graph as a result, while the middle (red) graph is discarded, since e2 is not contained in the graph. Utilizing the second criterion, and demanding that an edge exists between e1 and e2 , yields only the left graph, since it is the only one in which both entities occur and are directly connected. By utilizing entity-centric timelines, the focus is laid on certain entities, or on relationships between entities. The first criterion creates a timeline that includes only those points in time the entity co-occurred with; the second one a timeline that shows points in time where two (or more) entities occurred, and where they possibly interacted with each other. With the help of the function sent these interactions can be analyzed further, since it is possible do display all textual co-occurrences of two entities around a certain date in the document collection. 3.2.2 Zooming For zooming, the partitioning of the dates D into di↵erent granularities is utilized. Since there exists a distinct hierarchy for these dates, time-centric graphs for timestamps of finer granularities are necessarily subgraphs of all time-centric graphs of coarser temporal resolution (if edge and node weightings are disregarded). For zooming, the user can start from an arbitrary network G, identifying relevant relationships and entities. Zooming can then be divided into the two scenarios of zooming in and out: On the one hand, by employing a zooming out operation, the respective coarser network is displayed, which is a supergraph of G. In this supergraph, the respective subgraph G can be highlighted, but also a broader context can be explored by observing how certain relationships are embedded in the bigger picture. On the other hand, by zooming in, and given the same network G, the user can select a network of finer granularity that ranges in the same temporal interval as G. For example, given the graph associated with June 2000, the user can select one of the days from June 2000 for which a graph exists, investigating the origin of specific relations, and being able to identify crucial parts of the documents by utilizing the function sent. 4 Experimental Results In this section, we describe the data set used for evaluation and present the results to demonstrate the usefulness of the approach. 4.1 Description of Data Set For evaluation, we utilize a German document collection containing juridical protocols of the NSU (National Socialist Underground, or Nationalsozialistischer Untergrund in German) trial extracted from NSU Watch1 , which also gives an introduction to the case. The NSU data set covers 387 documents, each representing one of the 437 trial days, and consisting of 180,887 sentences and 974,892 words. Protocols for fifty trial days are missing, because they are not available from NSU Watch. Most of the omitted documents are detailing the last 100 days of the trial. Additionally, we preprocess the documents by removing stop words and out-of-vocabulary tokens. 1 https://www.nsu-watch.info/2013/05/sitzungstermine/; accessed 3. January 2020; The used data set is actually a cleaned version of the extracted data, and all results presented here are in regard to this cleaned version. 34 Figure 3: Illustration of a typical exploration process. (a) Excerpt of the constructed timeline, showing di↵erent time granularities. By selecting one of the five most frequently occurring words in a year all networks are marked that contain that term. (b) Clicking on April 6 shows the associated time-centric graph reduced to the 10 nodes with the highest tf-itf score, ignoring date nodes. Size of nodes and edges depends on assigned weights. (c) Clicking on an edge allows the user to browse term co-occurrences in the vicinity of temporal tags (red highlights) in documents. Timestamp generation is done by temporal tagging employing HeidelTime [SG13], resulting in total in 15,104 date instances. After the extraction of time-centric co-occurrence graphs around these individual dates, utilizing a window size of 4 sentences, our method yields a total of 1072 networks, 859 having day granularity, 191 month granularity, and 22 year granularity. Further text processing, e.g., sentence splitting or named entity recognition, is done with the help of spaCy [HM], employing the de core news md model. For evaluation purposes, we remove the node of the associated date from each graph, since its co-occurrence with all terms in the network is trivial. Figure 2 depicts the occurrence distribution over mostly person and location entities and timestamps in the data set as well as a year occurrence distribution, showing that most of the extracted timestamps range between 1990 and 2020, which coincides with the period of time most relevant to the activities of the NSU and the trial. It can also be observed that a few dates lie in the future, which is mostly due to errors in HeidelTime’s tagging. 4.2 Timeline Exploration The focus of this work lies on the exploration of document collections, hence, we present a typical scenario of how our model is applied. Figure 3 illustrates a timeline constructed using the data described above. By searching for certain keywords one can highlight time periods, or points in time, the keyword is part of, and thus limit a search to networks one is interested in. These networks can then be analyzed manually, potentially identifying other entities relevant to the topic under investigation, or finding relations associated with a certain date. These relations can be further examined utilizing the function sent, such that the co-occurrences of two terms in the data set are illustrated and highlighted. Note that the networks also serves an index to the documents and sentences in which (co-occurring) dates and terms occur. 35 Table 1: The dates, victims and cities associated with the murders of the NSU as well as the total number of occurrences of the entity in the document collection. Only Yozgat is among the 100 most-occurring terms in the text. The tf-itf score always refers to the rank of the term in the associated time-centric co-occurrence graph. Date Victim #Occs tf-itf rank City #Occs tf-itf rank September 9, 2000 Şimşek 132 5 Nuremberg 239 6 June 13, 2001 Özüdoğru 91 1 Nuremberg 239 2 June 27, 2001 Taşköprü 79 1 Hamburg 88 8 August 29, 2001 Kılıç 112 1 Munich 213 4 February 25, 2004 Turgut 84 2 Rostock 81 1 June 9, 2005 Yaşar 107 1 Nuremberg 239 2 June 15, 2005 Boulgarides 82 1 Munich 213 2 April 4, 2006 Kubaşık 165 1 Dortmund 218 2 April 6, 2006 Yozgat 395 2 Kassel 291 3 April 25, 2007 Kiesewetter 143 4 Heilbronn 149 1 Table 2: The three highest tf-itf ranks for the two cities (a) Kassel, and (b) Nuremberg. Date tf-itf value Date tf-itf value April 6, 2006 0.000806 June 13, 2001 0.000287 March 18, 2006 0.000303 June 9, 2005 0.000221 April 4, 2006 0.000120 September 9, 2000 0.000175 (a) (b) 4.3 Day-centric Evaluation For evaluation purposes, we investigate the results for the tf-itf ranking for certain key events of the NSU crimes. Table 1 gives an overview of the victims and places of the 10 murders committed by the NSU, also stating the number of occurrences and the respective tf-itf rank in the associated time-centric network. It should be expected that such key persons and locations are well represented in the constructed time-centric graphs. And indeed, one can observe that for all murder dates, the name of the victim has at least tf-itf rank 5, most of them even rank first. While not as predominant as the name of the victims, the respective locations of the murders also rank very high in regard to their tf-itf scores. Hence, one can expect that the constructed time-centric co-occurrence graphs adequately represent the events discussed during the trial. Table 2 shows the dates for which the two cities Kassel and Nuremberg have the highest tf-itf scores. Comparison with Table 1 indicates that the three major dates for Nuremberg are all associated with a murder during the respective day. For Kassel, the by far most prominent date is the date of the murder of Halit Yozgat. The two other dates are shortly before the incident, with March 18, 2006, being the day of a right-wing extremist concert discussed during the trial. 5 Conclusion and Ongoing Work In this paper, we introduced time-centric co-occurrence networks, and presented a framework based on these networks that enables users to explore document collections using a timeline. We also introduced two applications of the proposed model, entity-centric timelines and zooming operations. The method was then applied to a collection of court protocols of the NSU trial, and we demonstrated the usefulness of our approach by showing that persons and cities relevant to the trial are well represented in our model. As future work, we aim to refine the employed edge weighting technique, e.g., taking into account the distance between two words when extracting co-occurrences, and hence, extending the possibilities for more complex analyses of entity relationships. 36 References [ABYG07] Omar Alonso, Ricardo Baeza-Yates, and Michael Gertz. Exploratory Search using Timelines. In SIGCHI 2007 Workshop on Exploratory Search and HCI Workshop, volume 1, pages 1–4, 2007. [ASBYG11] Omar Alonso, Jannik Strötgen, Ricardo A Baeza-Yates, and Michael Gertz. Temporal Information Retrieval: Challenges and Opportunities. Temporal Web Analytics Workshop, 11:1–8, 2011. [CDJJ14] Ricardo Campos, Gaël Dias, Alı́pio M Jorge, and Adam Jatowt. Survey of temporal information retrieval and related applications. ACM Computing Surveys (CSUR), 47(2):1–41, 2014. [HM] Matthew Honnibal and Ines Montani. Spacy: Industrial-Strength Natural Language Processing, version 2.1.8, https://spacy.io/, accessed 17. March 2020. [KMN98] B Knight, J Ma, and E Nissan. Representing Temporal Knowledge in Legal Discourse. Information and Communications Technology Law, 7(3):199–211, 1998. [LSCO10] Nikolaos Lagos, Frederique Segond, Stefania Castellani, and Jacki O’Neill. Event Extraction for Legal Case Building and Reasoning. In International Conference on Intelligent Information Pro- cessing, pages 92–101. Springer, 2010. [PSW12] Natalia Prytkova, Marc Spaniol, and Gerhard Weikum. Predicting the evolution of taxonomy restructuring in collective web catalogues. In WebDB, pages 49–54, 2012. [SG13] Jannik Strötgen and Michael Gertz. Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 47(2):269–298, 2013. [SG15] Jannik Strötgen and Michael Gertz. A Baseline Temporal Tagger for all Languages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 541–547, 2015. [SM19] Julius Steen and Katja Markert. Abstractive Timeline Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 21–31, 2019. [SSBG15] Andreas Spitz, Jannik Strötgen, Thomas Bögel, and Michael Gertz. Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Companion Volume, pages 1375–1380, 2015. [Wik20a] Wikipedia contributors. Cumex-files — Wikipedia, the free encyclopedia, 2020. [Online; accessed 26-January-2020]. [Wik20b] Wikipedia contributors. Enron scandal — Wikipedia, the free encyclopedia, 2020. [Online; accessed 26-January-2020]. [Wik20c] Wikipedia contributors. National socialist underground — Wikipedia, the free encyclopedia, 2020. [Online; accessed 26-January-2020]. [Wik20d] Wikipedia contributors. Panama papers — Wikipedia, the free encyclopedia, 2020. [Online; accessed 26-January-2020]. 37