Towards Interactive Summarization of Large Document Collections Benjamin Hättasch TU Darmstadt, Germany benjamin.haettasch@cs.tu-darmstadt.de ABSTRACT unimportant for her current goal. This process will be repeated We present a new system for custom summarizations of large text iteratively until the user is satisfied with the quality. corpora at interactive speed. The task of producing textual sum- maries is an important step to understand collections of topic- 2 OVERVIEW related documents and has many real-world applications in journal- The main idea of this work is to enable the original system to ism, medicine, and many more. Our system consists of a sampling achieve interactive response time on arbitrary large document col- component that ranks and selects sentences from a given corpus lections with a similar quality of the resulting summary. A study and uses an integer-linear program (ILP) to produce the summary. [2] has shown that even small delays (more than 500 ms) signifi- Both components are called multiple times to improve the quality cantly decrease a user’s activity level, dataset coverage, and insight of the summarization iteratively. The human is brought into the discovery rate, hence one should aim for lower runtimes. loop to gather feedback in every iteration about which aspects of Instead of looking at the complete document collection in every the intermediate summaries satisfy their individual information iteration, our approach only considers a sample from the docu- needs. That way, our system can provide a similar quality level as ments per iteration and thus trades performance for quality of an ILP-approach working on the full corpus but with a constant the summary. For creating the sample, two important factors thus runtime independent of the corpus size. play a role: The first factor is the sample size (i.e., the number of sentences in the sample), which determines the runtime of the sum- 1 INTRODUCTION marization method; the second factor is the sampling procedure, Users like journalists or lawyers confronted with a large collec- that determines which sentences are part of the sample. tion of unknown documents need to find the overall relation and For deciding about the sample size, we need to be able to esti- event structure of those documents. An important step for this mate the runtime for solving the ILP which mainly depends on its understanding process is to produce a concise textual summary complexity (in number of constraints). In order to do so, we devised that captures the information most relevant to a user’s aims (e.g. a cost function that maps the number of constraints to an estimated degree of details or covered topics). While many automatic text runtime. We use this function to derive the maximum sample size k summarization approaches have been suggested, there exist only such that the runtime stays below a chosen interactivity threshold. a few that produce different summaries targeted at the individual For deciding which sentences should be contained in the sample, user. One of those is the system recently proposed by P.V.S. and we developed a novel heuristic called information density that is Meyer [1]. A major limiting factor however is that their system does computed for each sentence. It ranks the sentences by the weight not scale for large corpus sizes since the runtime of their approach density of concepts in it normalized by the sentence length. We then which uses an ILP solver at its core grows exponentially with the only select the top-k sentences based on this heuristic. The intuition amount of sentences and may take hours for each iteration. This is that sentences with a higher information density (containing hinders the user from performing an adequate amount of feedback more concepts rated as important) are more relevant to the user. rounds to get a suitable level of quality and customization. With this sampling strategy, we are already able to archive a similar Therefore, in our work we build upon their system but introduce quality as the original system at a fraction of the runtime. a ranking based sampling component. That results in a constant Our future work includes developing more advanced sampling computation time of each iteration depending on the sample size strategies that can further improve the quality and increase the instead of the corpus size. With this new approximate summariza- amount of feedback on different concepts. One direction would be tion model we can guarantee interactive speeds even for large text to devise stratified sampling strategies using additional importance collections to keep the user engaged in the process. The original measures for (groups of) sentences. Furthermore, in addition to the system consists of a web-based interface that allows the user to current oracle-based approach for evaluation that gives feedback provide feedback and a backend which computes the summaries according to reference summaries we plan a user study. using an ILP. The user requests a summary and can then anno- REFERENCES tate the concepts of the summary i.e. mark them as important or [1] P. V. S. Avinesh and C. M. Meyer. Joint optimization of user-desired content in multi-document summaries by learning from user feedback. In ACL, pages In order to get a better overview of how the system works, we recommend the readers 1353–1363. ACL, 2017. to watch this video: http://vimeo.com/257601765 [2] Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics, 20:2122–2131, 2014. DESIRES 2018, August 2018, Bertinoro, Italy This work has been supported by the German Research Foundation as part of the © 2018 Copyright held by the author(s). Research Training Group AIPHES under grant No. GRK 1994/1.