Towards Interactive Summarization of
                                        Large Document Collections
                                                                       Benjamin Hättasch
                                                                 TU Darmstadt, Germany
                                                           benjamin.haettasch@cs.tu-darmstadt.de
ABSTRACT                                                                              unimportant for her current goal. This process will be repeated
We present a new system for custom summarizations of large text                       iteratively until the user is satisfied with the quality.
corpora at interactive speed. The task of producing textual sum-
maries is an important step to understand collections of topic-                       2     OVERVIEW
related documents and has many real-world applications in journal-                    The main idea of this work is to enable the original system to
ism, medicine, and many more. Our system consists of a sampling                       achieve interactive response time on arbitrary large document col-
component that ranks and selects sentences from a given corpus                        lections with a similar quality of the resulting summary. A study
and uses an integer-linear program (ILP) to produce the summary.                      [2] has shown that even small delays (more than 500 ms) signifi-
Both components are called multiple times to improve the quality                      cantly decrease a user’s activity level, dataset coverage, and insight
of the summarization iteratively. The human is brought into the                       discovery rate, hence one should aim for lower runtimes.
loop to gather feedback in every iteration about which aspects of                        Instead of looking at the complete document collection in every
the intermediate summaries satisfy their individual information                       iteration, our approach only considers a sample from the docu-
needs. That way, our system can provide a similar quality level as                    ments per iteration and thus trades performance for quality of
an ILP-approach working on the full corpus but with a constant                        the summary. For creating the sample, two important factors thus
runtime independent of the corpus size.                                               play a role: The first factor is the sample size (i.e., the number of
                                                                                      sentences in the sample), which determines the runtime of the sum-
1     INTRODUCTION                                                                    marization method; the second factor is the sampling procedure,
Users like journalists or lawyers confronted with a large collec-                     that determines which sentences are part of the sample.
tion of unknown documents need to find the overall relation and                          For deciding about the sample size, we need to be able to esti-
event structure of those documents. An important step for this                        mate the runtime for solving the ILP which mainly depends on its
understanding process is to produce a concise textual summary                         complexity (in number of constraints). In order to do so, we devised
that captures the information most relevant to a user’s aims (e.g.                    a cost function that maps the number of constraints to an estimated
degree of details or covered topics). While many automatic text                       runtime. We use this function to derive the maximum sample size k
summarization approaches have been suggested, there exist only                        such that the runtime stays below a chosen interactivity threshold.
a few that produce different summaries targeted at the individual                        For deciding which sentences should be contained in the sample,
user. One of those is the system recently proposed by P.V.S. and                      we developed a novel heuristic called information density that is
Meyer [1]. A major limiting factor however is that their system does                  computed for each sentence. It ranks the sentences by the weight
not scale for large corpus sizes since the runtime of their approach                  density of concepts in it normalized by the sentence length. We then
which uses an ILP solver at its core grows exponentially with the                     only select the top-k sentences based on this heuristic. The intuition
amount of sentences and may take hours for each iteration. This                       is that sentences with a higher information density (containing
hinders the user from performing an adequate amount of feedback                       more concepts rated as important) are more relevant to the user.
rounds to get a suitable level of quality and customization.                          With this sampling strategy, we are already able to archive a similar
   Therefore, in our work we build upon their system but introduce                    quality as the original system at a fraction of the runtime.
a ranking based sampling component. That results in a constant                           Our future work includes developing more advanced sampling
computation time of each iteration depending on the sample size                       strategies that can further improve the quality and increase the
instead of the corpus size. With this new approximate summariza-                      amount of feedback on different concepts. One direction would be
tion model we can guarantee interactive speeds even for large text                    to devise stratified sampling strategies using additional importance
collections to keep the user engaged in the process. The original                     measures for (groups of) sentences. Furthermore, in addition to the
system consists of a web-based interface that allows the user to                      current oracle-based approach for evaluation that gives feedback
provide feedback and a backend which computes the summaries                           according to reference summaries we plan a user study.
using an ILP. The user requests a summary and can then anno-                          REFERENCES
tate the concepts of the summary i.e. mark them as important or                       [1] P. V. S. Avinesh and C. M. Meyer. Joint optimization of user-desired content
                                                                                          in multi-document summaries by learning from user feedback. In ACL, pages
In order to get a better overview of how the system works, we recommend the readers       1353–1363. ACL, 2017.
to watch this video: http://vimeo.com/257601765                                       [2] Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis.
                                                                                          IEEE transactions on visualization and computer graphics, 20:2122–2131, 2014.

DESIRES 2018, August 2018, Bertinoro, Italy                                           This work has been supported by the German Research Foundation as part of the
© 2018 Copyright held by the author(s).                                               Research Training Group AIPHES under grant No. GRK 1994/1.