=Paper=
{{Paper
|id=Vol-2167/short6
|storemode=property
|title=Towards Interactive Summarization of Large Document Collections
|pdfUrl=https://ceur-ws.org/Vol-2167/short6.pdf
|volume=Vol-2167
|authors=Benjamin Hättasch
|dblpUrl=https://dblp.org/rec/conf/desires/Hattasch18
}}
==Towards Interactive Summarization of Large Document Collections==
Towards Interactive Summarization of
Large Document Collections
Benjamin Hättasch
TU Darmstadt, Germany
benjamin.haettasch@cs.tu-darmstadt.de
ABSTRACT unimportant for her current goal. This process will be repeated
We present a new system for custom summarizations of large text iteratively until the user is satisfied with the quality.
corpora at interactive speed. The task of producing textual sum-
maries is an important step to understand collections of topic- 2 OVERVIEW
related documents and has many real-world applications in journal- The main idea of this work is to enable the original system to
ism, medicine, and many more. Our system consists of a sampling achieve interactive response time on arbitrary large document col-
component that ranks and selects sentences from a given corpus lections with a similar quality of the resulting summary. A study
and uses an integer-linear program (ILP) to produce the summary. [2] has shown that even small delays (more than 500 ms) signifi-
Both components are called multiple times to improve the quality cantly decrease a user’s activity level, dataset coverage, and insight
of the summarization iteratively. The human is brought into the discovery rate, hence one should aim for lower runtimes.
loop to gather feedback in every iteration about which aspects of Instead of looking at the complete document collection in every
the intermediate summaries satisfy their individual information iteration, our approach only considers a sample from the docu-
needs. That way, our system can provide a similar quality level as ments per iteration and thus trades performance for quality of
an ILP-approach working on the full corpus but with a constant the summary. For creating the sample, two important factors thus
runtime independent of the corpus size. play a role: The first factor is the sample size (i.e., the number of
sentences in the sample), which determines the runtime of the sum-
1 INTRODUCTION marization method; the second factor is the sampling procedure,
Users like journalists or lawyers confronted with a large collec- that determines which sentences are part of the sample.
tion of unknown documents need to find the overall relation and For deciding about the sample size, we need to be able to esti-
event structure of those documents. An important step for this mate the runtime for solving the ILP which mainly depends on its
understanding process is to produce a concise textual summary complexity (in number of constraints). In order to do so, we devised
that captures the information most relevant to a user’s aims (e.g. a cost function that maps the number of constraints to an estimated
degree of details or covered topics). While many automatic text runtime. We use this function to derive the maximum sample size k
summarization approaches have been suggested, there exist only such that the runtime stays below a chosen interactivity threshold.
a few that produce different summaries targeted at the individual For deciding which sentences should be contained in the sample,
user. One of those is the system recently proposed by P.V.S. and we developed a novel heuristic called information density that is
Meyer [1]. A major limiting factor however is that their system does computed for each sentence. It ranks the sentences by the weight
not scale for large corpus sizes since the runtime of their approach density of concepts in it normalized by the sentence length. We then
which uses an ILP solver at its core grows exponentially with the only select the top-k sentences based on this heuristic. The intuition
amount of sentences and may take hours for each iteration. This is that sentences with a higher information density (containing
hinders the user from performing an adequate amount of feedback more concepts rated as important) are more relevant to the user.
rounds to get a suitable level of quality and customization. With this sampling strategy, we are already able to archive a similar
Therefore, in our work we build upon their system but introduce quality as the original system at a fraction of the runtime.
a ranking based sampling component. That results in a constant Our future work includes developing more advanced sampling
computation time of each iteration depending on the sample size strategies that can further improve the quality and increase the
instead of the corpus size. With this new approximate summariza- amount of feedback on different concepts. One direction would be
tion model we can guarantee interactive speeds even for large text to devise stratified sampling strategies using additional importance
collections to keep the user engaged in the process. The original measures for (groups of) sentences. Furthermore, in addition to the
system consists of a web-based interface that allows the user to current oracle-based approach for evaluation that gives feedback
provide feedback and a backend which computes the summaries according to reference summaries we plan a user study.
using an ILP. The user requests a summary and can then anno- REFERENCES
tate the concepts of the summary i.e. mark them as important or [1] P. V. S. Avinesh and C. M. Meyer. Joint optimization of user-desired content
in multi-document summaries by learning from user feedback. In ACL, pages
In order to get a better overview of how the system works, we recommend the readers 1353–1363. ACL, 2017.
to watch this video: http://vimeo.com/257601765 [2] Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis.
IEEE transactions on visualization and computer graphics, 20:2122–2131, 2014.
DESIRES 2018, August 2018, Bertinoro, Italy This work has been supported by the German Research Foundation as part of the
© 2018 Copyright held by the author(s). Research Training Group AIPHES under grant No. GRK 1994/1.