Discourse Processing for Text Analysis:
         Recent successes, current challenges

                                 Bonnie Webber1

    School of Informatics, University of Edinburgh, UK bonnie.webber@ed.ac.uk


      Abstract. Computational discourse processing has come a long way in
      the 10 years since I spoke at ACL’2009 on Discourse: Early problems,
      current successes, future challenges. Much of this progress can be at-
      tributed to the vast amounts of textual data that have become available
      and to a concomitant weakening of theoretical commitments, so as to be
      able to use the data in information extraction, sentiment analysis, ques-
      tion answering, etc. Along with weakened commitments to the demands
      of particular theories, has been a greater willingness to consider what
      can be learned from textual data and from various forms of annotation,
      in English and in other languages as well.
      This paper briefly summarizes (1) changing assumptions about discourse
      structure; (2) recent work on lexico-syntactic grounding of low-level dis-
      course structure and frameworks for higher-level discourse structure that
      recognize differences in genre; and (3) suggestions for addressing some of
      the challenges still facing us. For more detail, the reader is encouraged
      to go to the references themselves.

      Keywords: discourse processing · discourse structure · discourse rela-
      tions.


1    Introduction

Discourse poses many challenges to text processing systems, beyond those posed
by isolated clauses. Firstly, since one can refer to anything mentioned in previ-
ous clauses, even things mentioned only implicitly, resolving referring expressions
becomes a challenge. Secondly, since there is information embodied in relations
that hold between clauses or sentences or larger spans of text — relations that
may be signalled explicitly in the discourse or left to inference — they need to
be detected, so that the information in the relation can be extracted. Thirdly,
since the reason for some piece of text being included in a discourse may re-
flect communicative goals that are specific to a particular genre, such goals also
need to be modelled, recognized and applied to whatever information has been
extracted.
    My concern here is with discourse structure and discourse relations — what
was assumed prior to 2009, how that has changed in the intervening years, and
where we are now. A reader who would like a general introduction to discourse
processing is referred to Stede’s 2012 monograph [30]. A reader interested in
2       B. Webber

discourse structure and its use in language technology prior to 2012 is referred
to [35]. Finally, for examples illustrating points made in this brief paper, the
reader is referred to the slides of this keynote available on the BIRNDL 2019
website.


2     Early assumptions about discourse and discourse
      processing
Two early computational assumption about discourse were (1) that it has a
simple computational structure, specifiable in the form of a regular expression
or context-free grammar (CFG), and (2) that that structure covered the entire
text. This could be seen in McKeown’s schemas [17], in work on topic segmenta-
tion of texts [3, 6, 11, 12, 15], in the tree-structured analyses that followed from
Rhetorical Structure Theory (RST) [16] or from seeing a task as comprising a
sequence of sub-tasks and a text describing how to carry it out as being simi-
larly composed of a sequence of sub-texts [7]. In the resulting tree structure, the
left-to-right order of the children of a non-terminal node would correspond to
the temporal ordering of sub-tasks and immediate dominance between a parent
and its children would correspond to sub-task inclusion. The text itself would
correspond to simple top-down, L-R tree traversal.
    These two assumptions were held so widely that any work on text structure
that didn’t conform to them was ignored. This was true of work by Sibun [28],
which modelled spoken descriptions of complex structures such as house layouts
as a linear traversal of a complex graph, which required both marking which
nodes had already been visited (since they could be reached in multiple ways)
and consulting a decision process when more than one node could be visited
next. Although Sibun’s view of text as a structure that systematically reflected
the structure of the world was no different than Dale’s, Sibun’s work was ignored
as not conforming to the view of discourse structure as a tree.


3     Issues at play since then
Subsequent to this early work, computational researchers began to acknowledge
(1) that text structure varies with genre such that, for example, persuasive texts
differ in structure from instructions, which both differ in structure from descrip-
tive texts; and (2) that simplicity in text structure, whatever the genre, is just
a useful simplification that will be violated when needed or else completely dis-
carded. Instead, researchers have accepted different kinds of discourse structure
and, adopting a more empirical perspective, have turned to looking at what pro-
vides evidence for discourse structure – in particular, lexico-syntactic evidence.

3.1    Multiple kinds of discourse structure
The earliest claim to the existence of multiple kinds of discourse structure was
made by Grosz and Sidner [10], who posited a linguistic structure (signalled
                                     Discourse Processing for Text Analysis       3

by discourse cues), an intentional structure (modelling how the purpose of one
segment contributed to that of another), and an attentional structure (in the
form of a stack, reflecting its origin in tree structures for discourse).
    While Grosz and Sidner were primarily focussed on accounting for and mod-
elling intentional structure, Moore and Pollack [19] wanted to break out of the
requirement in RST [16] of only one sense relation holding between any two
(adjacent) discourse segments. This often forced annotators to choose either a
relation between the information conveyed in consecutive elements of a coherent
discourse (informational relations) or a relation reflecting the aim of discourse to
effect changes in the mental state of the discourse participants, through a textual
plan whose consecutive elements relate in terms of their roles in the plan (inten-
tional relations). Instead, Moore and Pollack proposed one discourse structure
that reflected informational relations between the elements, and a separate one
that reflected intentional relations between those same elements. Importantly
they pointed out that these structures may not be isomorphic, even though they
cover the same text. This could mean that a text segment that was prominent
in one structure could be less so in the other.
    Genre is clearly at play in the structures proposed for discourse. While the
texts considered by Moore and Pollack [19] were persuasive texts, Knott and his
colleagues [14] aimed to generate descriptions of objects in museum displays that
were appropriate in the context of other objects that had already been described
to the visitor and other objects in the same display. Although subscribing to
RST [16], Knott and his colleagues had to face the problem that the texts they
were modelling violated RST’s assumption that the spans linked by a discourse
relation had to be adjacent, or if interrupted by another span, had to be linked
to the initial span by a relation of the same type.
    However, after noticing that all violations of this assumption involved RST’s
object-attribute elaboration relation (where one segment presents an ob-
ject, and the next presents one of its attributes), Knott et al [14] proposed a
hybrid structure for discourse, taking it to be structured as a sequence of RST
trees (minus elaboration), supplemented by an entity-based model of focus
structure. That is, these descriptive texts were structured as a sequence of RST
trees. each of whose top nodes focussed on some entity that had been mentioned
previously and was then further described in the rest of its tree-structured seg-
ment.
    More recently, researchers concerned with argumentation such as Stede and
his colleagues [31], Stab and Gurevych [29] and others have been exploring how
argumentation structure can be grounded in an RST-based coherence structure.
This is currently a very active area of research, so links between other forms of
discourse and dialogue structures are being explored as well.


3.2   Empirical bases for discourse structure

In the early 90s, researchers were exploring the idea that sentence-level syntactic
structure was projected from structures associated with lexical items. This was
4       B. Webber

true of both lexicalized Tree-Adjoining Grammar [37] and Combinatory Cat-
egorial Grammar [33]. This encouraged researchers to ask whether the same
could hold of discourse and to build corpora based on the notion that low-level
discourse structure was signalled either by explicit lexico-syntactic phrases or
constructions or by adjacency that would leading readers to infer a relation be-
tween the adjacent units [1, 2, 20–22, 36, 39–43]. This in turn led to researchers
developing lexicons of discourse connectives such as [8, 18, 26, 32] and even a re-
annotation of the RST Corpus to identify the likely linguistic signals for the
annotated relations [9].


4   Current challenges
To my mind, there are at least two areas in which initial efforts require more
investment, in order to see a pay-off: (1) More general acceptance of segments
contributing their semantics and pragmatics to multiple discourse relations; and
(2) exploring the stance/sentiment associated with discourse connectives and
discourse relations, so as to more accurately describe speakers’ and writers’ at-
titudes towards their subjects.
    With respect to segments simultaneously linked by multiple sense relations,
despite there being only one (or possibly even no) explicit discourse connective
between them, evidence comes both from experiments using crowdsourcing [23–
25] and from corpus annotation [22, 36]. Other evidence comes from cross-lingual
parallel texts, which often differ in their signalling of discourse relations [27].
    With respect to exploring the stance/sentiment associated with discourse
connectives and discourse relations, there are some obvious examples, such as
the preposition thanks to. While it clearly indicates that one clause is seen as
the Reason for the other clause holding, as in
    Operations are running smoothly thanks to decentralizing the company’s
    computer system before the quake
thanks to also indicates the speaker’s positive attitude to what is expressed in
the latter clause, which would not be evident if the phrase as result of had been
used instead.
    Another example is the connective but then (also phrased but then again).
While it signals a Concession relation, with one clause denying an expectation
raised by the conceded clause, as in
    To many, it was a ceremony more befitting a king than a rural judge
    seated in the isolated foothills of the southern Allegheny Mountains. But then
    Judge O’Kicki often behaved like a man who would be king – and, some
    say, an arrogant and abusive one.
but then (again) also indicates that the speaker’s attitude that the listener shouldn’t
be surprised.
    While papers have been written about the contribution of discourse relations
to the expression of sentiment (e.g, [4, 5, 13, 34], this should be complemented
                                       Discourse Processing for Text Analysis         5

by aggregating descriptions of the range of sentiments (stances) conveyed by
discourse connectives and in discourse relations across multiple languages.
    With the appearance of new discourse annotated corpora such as the TED-
MDB [41] and the expanded Penn Discourse TreeBank [36] and with new discourse-
related Shared Tasks [38], discourse-based information should become more in-
tegral to Natural Language Processing and hence more available for use by any
technologies such as Information Retrieval and Question Answering that use and
thereby add value to text.


References
 1. Al-Saif, A., Markert, K.: The Leeds Arabic Discourse Treebank: Annotating dis-
    course connectives for Arabic. In: Proceedings, 7th International Conference on
    Language Resources and Evaluation (LREC 2010) (2010)
 2. Al-Saif, A., Markert, K.: Modelling discourse relations for Arabic. In: Proceedings,
    Empirical Methods in Natural Language Processing. pp. 736–747 (2011)
 3. Barzilay, R., Lee, L.: Catching the Drift: Probabilistic content models, with appli-
    cations to generation and summarization. In: Proceedings of the 2nd Human Lan-
    guage Technology Conference and Annual Meeting of the North American Chapter,
    Association for Computational Linguistics. pp. 113–120 (2004)
 4. Bhatia, P., Ji, Y., Eisenstein, J.: Better document-level sentiment analysis from
    rst discourse parsing. In: Proceedings, Empirical Methods in Natural Language
    Processing (EMNLP). pp. 2212–2218 (2015)
 5. Chardon, B., Benamara, F., Mathieu, Y., Popescu, V., Asher, N.: Measuring the
    effect of discourse structure on sentiment analysis. In: Proceedings, 14th Interna-
    tional Conference on Intelligent Text Processing and Computational Linguistics
    (CICLing 2013) (2013)
 6. Chung, G.: Sentence retrieval for abstracts of randomized controlled trials. BMC
    Medical Informatics and Decision Making 10(9) (February 2009)
 7. Dale, R.: Generating Referring Expressions. MIT Press, Cambridge MA (1992)
 8. Das, D., Scheffler, T., Bourgonje, P., Stede, M.: Constructing a lexicon of english
    discourse connectives. In: Proceedings of the 56th Annual Meeting of the ACL
    (August 2018)
 9. Das, D., Taboada, M.: Rst signalling corpus: a corpus of signals of coherence rela-
    tions. Language Resources and Evaluation 52, 149–184 (2018)
10. Grosz, B., Sidner, C.: Attention, intention and the structure of discourse. Compu-
    tational Linguistics 12(3), 175–204 (1986)
11. Hearst, M.: TextTiling: Segmenting text into multi-paragraph subtopic passages.
    Computational Linguistics 23(1), 33–64 (1997)
12. Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in
    scientific abstracts using conditional random fields. In: Proceedings of the 3rd In-
    ternational Joint Conference on Natural Language Processing. pp. 381–388 (2008)
13. Ji, Y., Smith, N.: Neural discourse structure for text categorization. In: Proceed-
    ings, Association for Computational Linguistics (ACL). pp. 996–1005 (2017)
14. Knott, A., Oberlander, J., O’Donnell, M., Mellish, C.: Beyond elaboration: The
    interaction of relations and focus in coherent text. In: Sanders, T., Schilperoord,
    J., Spooren, W. (eds.) Text Representation:Linguistic and psycholinguistic aspects,
    pp. 181–196. John Benjamins Publishing (2001)
6       B. Webber

15. Malioutov, I., Barzilay, R.: Minimum cut model for spoken lecture segmentation.
    In: Proceedings of the 21st International Conference on Computational Linguistics
    and the 44th annual meeting of the Association for Computational Linguistics
    (2006)
16. Mann, W., Thompson, S.: Rhetorical Structure Theory: Toward a functional theory
    of text organization. Text 8(3), 243–281 (1988)
17. McKeown, K.: Text Generation: Using Discourse Strategies and Focus Constraints
    to Generate Natural Language Texts. Cambridge University Press, Cambridge,
    England (1985)
18. Mı́rovský, J., Synková, P., Rysová, M., Polı́lová, L.: Czedlex: A lexicon of czech
    discourse connectives. In: Prague Bulletin of Mathematical Linguistics. vol. 109,
    pp. 61–91 (2017)
19. Moore, J., Pollack, M.: A problem for RST: The need for multi-level discouse
    analysis. Computational Linguistics 18(4), 537–544 (1992)
20. Oza, U., Prasad, R., Kolachina, S., Sharma, D.M., Joshi, A.: The hindi discourse
    relation bank. In: Proc. 3rd ACL Language Annotation Workshop (LAW III).
    Singapore (August 2009)
21. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B.:
    The Penn Discourse TreeBank 2.0. In: Proceedings, 6th International Conference
    on Language Resources and Evaluation. pp. 2961–2968. Marrakech, Morocco (2008)
22. Prasad, R., Webber, B., Joshi, A.: Reflections on the Penn Discourse Treebank,
    comparable corpora and complementary annotation. Computational Linguistics
    40(4), 921–950 (2014)
23. Rohde, H., Dickinson, A., Schneider, N., Clark, C., Louis, A., Webber, B.: Fill-
    ing in the blanks in understanding discourse adverbials: Consistency, conflict,
    and context-dependence in a crowdsourced elicitation task. In: Proceedings of
    the Tenth Linguistic Annotation Workshop (LAW-X. pp. 49–58. Berlin (2016),
    http://www.aclweb.org/anthology/W16-1707
24. Rohde, H., Dickinson, A., Schneider, N., Clark, C., Louis, A., Webber, B.: Ex-
    ploring substitutability through discourse adverbials and multiple judgments. In:
    Proceedings, 12th International Conference on Computational Semantics (IWCS
    2017). Montpellier, France (2017)
25. Rohde, H., Johnson, A., Schneider, N., Webber, B.: Discourse coherence: Concur-
    rent explicit and implicit relations. In: Proceedings of the 56th Annual Meeting of
    the ACL (August 2018)
26. Roze, C., Danlos, L., Muller, P.: Lexconn: A french lexicon of discourse connectives.
    Discours 10 (2012)
27. Shi, W., Yung, F., Demberg, V.: Acquiring annotated data with cross-lingual ex-
    plicitation for implicit discourse relation classification. In: Proceedings, Workshop
    on Discourse Relation Parsing and Treebanking (DISRPT) (2019)
28. Sibun, P.: Generating text without trees. Computational Intelligence 8(1), 102–122
    (1992)
29. Stab, C., Gurevych, I.: Parsing argumentation structures in persuasive essays.
    Computational Linguistics 43, 619–659 (2017)
30. Stede, M.: Discourse Processing. Morgan & Claypool Publishers (2012)
31. Stede, M., Afantenos, S., Peldszus, A., Asher, N., Perret, J.: Parallel discourse
    annotations on a corpus of short texts. In: Proceedings of the 10th International
    Conference on Language Resources and Evaluation (LREC) (2016)
32. Stede, M., Umbach, C.: Dimlex: A lexicon of discourse markers for text generation
    and understanding. In: Proceedings, 36th Annual Meeting of the ACL (1998)
                                        Discourse Processing for Text Analysis          7

33. Steedman, M.: Surface Structure and Interpretation. Linguistic Inquiry Monograph
    30, MIT Press, Cambridge MA (1996)
34. Taboada, M., Voll, K., Brooke, J.: Extracting sentiment as a function of discourse
    structure and topicality. Tech. Rep. 2008-20, School of Computing Science, Simon
    Fraser University (2008)
35. Webber, B., Egg, M., Kordoni, V.: Discourse structure and language technology.
    Natural Language Engineering 18(4), 437–490 (2012)
36. Webber, B., Prasad, R., Lee, A., Joshi, A.: The Penn Discourse Treebank
    3.0 Annotation Manual. Tech. rep., University of Pennsylvania (2019), avail-
    able at https://catalog.ldc.upenn.edu/docs/LDC2019T05/PDTB3-Annotation-
    Manual.pdf
37. XTAG-Group, T.: A Lexicalized Tree Adjoining Grammar for En-
    glish. Tech. Rep. IRCS 01-03, University of Pennsylvania (2001), see
    ftp://ftp.cis.upenn.edu/pub/ircs/technical-reports/01-03
38. Zeldes, A., Das, D., Maziero, E.G., Antonio, J., Iruskieta, M.: The DISRPT 2019
    shared task on elementary discourse unit segmentation and connective detection.
    In: Proceedings, Workshop on Discourse Relation Parsing and Treebanking 2019.
    pp. 97–104. Minneapolis, MN (2019)
39. Zeyrek, D., Demirşahin, I., Sevdik-Çallı, A., Ögel Balaban, H., İhsan Yalçınkaya,
    Turan, U.D.: The annotation scheme of the Turkish Discourse Bank and an evalu-
    ation of inconsistent annotations. In: Proceedings of the 4th Linguistic Annotation
    Workshop (LAW III) (2010)
40. Zeyrek, D., Kurfalı, M.: An assessment of explicit inter- and intra-sentential dis-
    course connectives in turkish discourse bank. In: Proceedings of the 11th Language
    Resources and Evaluation Conference. European Language Resource Association,
    Miyazaki, Japan (May 2018), https://www.aclweb.org/anthology/L18-1634
41. Zeyrek, D., Mendes, A., Grishina, Y., Kurfalı, M., Gibbon, S., Ogrod-
    niczuk, M.: Ted multilingual discourse bank (ted-mdb): a parallel corpus an-
    notated in the pdtb style. Language Resources and Evaluation (april 2019).
    https://doi.org/10.1007/s10579-019-09445-9
42. Zhou, Y., Xue, N.: Pdtb-style discourse annotation of chinese text. In: Proc. 50th
    Annual Meeting of the ACL. Jeju Island, Korea (2012)
43. Zhou, Y., Xue, N.: The Chinese Discourse TreeBank: A Chinese corpus annotated
    with discourse relations. Journal of Language Resources and Evaluation 49, 397–
    431 (2015)