A Description of Turkish Discourse Bank 1.2 and an
Examination of Common Dependencies in Turkish
Discourse
Deniz Zeyrek1 , Mustafa Erolcan Er1
1
 Middle East Technical University, Graduate School of Informatics, Cognitive Science Department, Dumlupınar
Boulevard, No:1, 06800, Ankara, Turkey


                                         Abstract
                                         We describe Turkish Discourse Bank 1.2, the latest version of a discourse corpus annotated for explicitly
                                         or implicitly conveyed discourse relations, their constitutive units, and senses in the Penn Discourse
                                         Treebank style. We present an evaluation of the recently added tokens and examine three commonly
                                         occurring dependency patterns that hold among the constitutive units of a pair of adjacent discourse
                                         relations, namely, shared arguments, full embedding and partial containment of a discourse relation.
                                         We present three major findings: (a) implicitly conveyed relations occur more often than explicitly
                                         conveyed relations in the data; (b) it is much more common for two adjacent implicit discourse relations
                                         to share an argument than for two adjacent explicit relations to do so; (c) both full embedding and partial
                                         containment of discourse relations are pervasive in the corpus, which can be partly due to subordinator
                                         connectives whose preposed subordinate clause tends to be selected together with the matrix clause
                                         rather than being selected alone. Finally, we briefly discuss the implications of our findings for Turkish
                                         discourse parsing.

                                         Keywords
                                         Turkish, discourse connectives, converbial suffixal connectives, postpositions, dependencies in discourse


1. Introduction
Turkish is a language of more than 80M speakers and belongs to the Turkic sub-family of
the Altaic language family. It is a free word-order, agglutinating language with a complex
morphology, where suffixation is a major tool of both derivation and inflection.
   The existing Natural Language Processing (NLP) methods for Turkish have been developed
primarily targeting its morphology and syntax, lately extending to semantics [1], [2]. But there
is also need for discourse processing research, i.e. NLP beyond the boundaries of the sentence,
which would inform systems such as information retrieval, dialogue systems, summarization.
The first annotated discourse corpus of Turkish, Turkish Discourse Bank, or TDB [3] has been
developed to fill the gap in the discourse processing of Turkish and is expected to support
language technology applications that need information at the discourse level. It is a manually

The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural
Language Processing, ALTNLP’22, June 7-8, Koper, Slovenia
$ dezeyrek@metu.edu.tr (D. Zeyrek); erolcan.er@metu.edu.tr (M. E. Er)
 http://users.metu.edu.tr/dezeyrek/ (D. Zeyrek); https://github.com/erolcan-er (M. E. Er)
 0000-0001-9248-0141 (D. Zeyrek); 0000-0002-3009-4517 (M. E. Er)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
annotated corpus of modern Turkish that follows the rules and principles of the Penn Dis-
course Bank (PDTB) [4] annotating discourse relations over texts from various genres (fiction,
biography, newspaper editorials, popular magazines, etc.).1
   While the PDTB still remains the largest resource, the creation of PDTB-style discourse
corpora in languages such as Turkish, Hindi, Arabic and Chinese (see [5] and the references
therein) has been significant for empirical purposes and for discourse processing studies on
those languages. The empirical value of new resources is high because they underscore both
the variability and similarity of discourse-related phenomena across languages and enable
researchers to reach a better understanding of discourse structure.
   The goal of the current paper was twofold: (a) To describe the latest version of TDB, namely
TDB 1.2, a 40.000-word corpus, and evaluate the recently added tokens, (b) to highlight three
commonly occurring discourse dependencies found in TDB 1.2; i.e., shared arguments, full
embedding and proper containment of a discourse relation, and discuss the issues revolving
around these dependencies from the viewpoint of a morphologically rich language.
   The layout of the paper is as follows. We start with an overview of the notions that underlie
TDB, describe major annotation categories, and offer an evaluation of the new discourse relation
tokens (§2). In §3 we present the most common dependencies in the corpus and discuss the
linguistic issues surrounding them, and in §4, we conclude the paper.


2. Turkish Discourse Bank 1.2
2.1. What is discourse and what are discourse relations?
Discourse is the level of language above the sentence and can be found even within a sentence.
The assumption in discourse research is that a stretch of text is not an arbitrary sequence
of sentences but a structured, coherent unit that has a meaning more than the sum of its
parts. Discourse structure can be discovered by examining the patterns in multi-sentence or
multi-clausal texts and by finding the constitutive units of these patterns. This is essential for
correctly interpreting the text [6] and for the first step of discourse processing, i.e. discourse
segmentation, known as discourse parsing. One of the key aspects of discourse structure is
discourse relations (DRs), which denote the semantic relatedness of two text pieces at the local
level, such as contrast, additive, condition.2
   Following the PDTB’s lexicalized approach to discourse relations, it is assumed that there is
lexico-syntactic evidence for the existence of discourse relations. Thus, connectives are seen as
a primary source of evidence for the occurrence of a discourse relation. These are expressions
such as conjunctions and adverbs (or, although, moreover) linking clauses that have an abstract
object interpretation (propositions or eventualities) [7]. They are referred to as (explicit) discourse
connectives (DCs) signalling the presence of discourse relations (see example (7) in Appendix A).


    1
       The earliest version, TDB 1.0, is a ∼ 400.000-word corpus available at https://github.com/disrpt/
sharedtask2019/tree/master/data/tur.pdtb.tdb. TDB 1.1, a 40.000-word-version with fewer annotations, is available
at: https://github.com/disrpt/sharedtask2021/tree/main/data/tur.pdtb.tdb.
     2
       Although discourse relations can also express the pragmatic relatedness of discourse units (e.g. claim-evidence),
they are not annotated in TDB.
   But readers do not necessarily need discourse connectives, because they can easily infer the
relation from the adjacency of textual units, lexical relations, anaphoric links, etc. These have
been known as implicit relations. Furthermore, readers can add a discourse connective to an
implicitly conveyed relation to make it salient – called “implicit discourse connectives” [4] – and
can specify the textual parts of an implicit relation (see example (8) in Appendix A). Finally,
implicit relations may be realized by other means, namely, through Alternative Lexicalization
(AltLex), or as Hypophora, Entity Relation, as well as No Relation (more explanation and
examples are provided for each relation type in Appendix A).

2.2. What is annotated in TDB?
Based on the notions described in §2, three major aspects of discourse are annotated in TDB 1.2:
(a) Discourse relations conveyed explicitly or implicitly as well as by other means, (b) constitutive
units of discourse relations, which are known as arguments, (c) the sense of explicitly and
implicitly conveyed relations and AltLexes. There are always two textual units that constitute a
relation. The textual unit syntactically hosting the discourse connective is called Argument 2,
the other argument is named as Argument 1.3
   Although all languages have elements that function as discourse connectives, the syntactic
class to which they belong may differ. For example, Turkish not only has lexical connectives
(and, but, so) as most languages do but also converbial and postpositional connectives, grouped
as subordinators. These connectives relate a non-finite subordinate adverbial clause to the
matrix clause. In converbial structures, the marker of the relation is merely a suffix, called
suffixal connectives here, which generally correspond to subordinating conjunctions in English.
In postpositional structures, the marker of the relation has two parts, a postposition and
a nominalization suffix on the subordinate verb. Converbial suffixes and postpositions are
annotated as explicit discourse connectives in TDB.
   Importantly, the neutral order of arguments to subordinators is Argument2-Argument1 (i.e.
the argument that hosts the connective, which is the second argument, is normally preposed).
Both subordinator types are typically translated to English with a postposed subordinate
clause). Example (1) presents a suffixal connective, -ince ‘when’ (2), while (2) illustrates the
use of a postposition -diği gibi ‘as’ used as a discourse connective. Both connectives relate a
preposed non-finite subordinate clause to the matrix clause. In the examples throughout the
paper, the discourse connective is underlined, the inferred implicit discourse connective is both
underlined and put between parentheses. Argument 1 is shown in italic fonts, Argument 2 in
bold fonts. Each Turkish example is translated into English and shown between single quotation
marks.

(1)       Öğrenciler gel-ince aşağı indi.
          ‘He came down when the students arrived’.


      3
      At least two annotators, who were graduate students at Middle East Technical University, Cognitive Science
Department, were involved in each annotation cycle. The annotations were regularly checked and adjudicated by
the research team.
(2)     Ali’nin göster-diği gibi resim yaptım.
        ‘I drew as Ali showed’.


In the rest of the paper, the patterns that involve subordinator connectives will be in focus as
their syntactic behaviour is peculiar to Turkish and their analysis could highlight the differences
between Turkish and other languages annotated in the same style.

2.3. Evaluation and the finalization of TDB 1.2
TDB 1.2 currently has a total of 3870 relations, surpassing TDB 1.1 by 2014 relations (see
Appendix B for the tokens recently added to the corpus). Since earlier versions of TDB 1.1
have already been evaluated, it appeared meaningful to evaluate the recently added tokens. A
group of three expert annotators worked on a randomly chosen ∼ 42% of the new relations
(849 tokens in total) annotated since [8]. They were told to accept the annotations, revise them
where needed, or reject them, suggesting a new relation token where possible. All decisions
were made unanimously by them independently of the annotators who created and adjudicated
the recent tokens. In calculating inter-annotator agreement (IAA) statistics, we considered the
already adjudicated tokens as created by Annotator1, and the unanimously revised tokens as
created by Annotator2. Thus IAA was measured between two annotators. We measured various
types of IAA as described below and obtained a high degree of agreement in each case.
      • Agreement on the DRs’ type of realization: This is defined as the number of common
        discourse relations (pairs of clauses specified as a discourse relation by both annotators)
        over the number of unique relations, where all relations have the same type of realization
        [8, 9]. We used the exact match criterion [10] and present the results of this analysis in
        Table 1 in Appendix C.
      • Agreement on senses: The PDTB introduces a hierarchically organized semantic cate-
        gorization used to tag the sense(s) of Explicit and Implicit relations and AltLexes. The
        sense hierarchy has four Level-1 senses (Expansion, Contingency, Comparison, Tempo-
        ral), which are further refined by Level-2 senses. A third level specifies the semantic
        contribution of each argument [4]. Thus, a temporal relation anchored by then would be
        annotated as Temporal:Asynchronous:Precedence, while a temporal relation expressed
        by after would be annotated as Temporal:Asynchronous:Succession. Following [9], we
        calculated sense agreement on all three sense levels of the PDTB 3.0 sense hierarchy
        among common discourse relations using the exact match criterion. The results are listed
        in Table 2 in Appendix C.
      • Agreement on argument spans: TDB 1.2 asks the annotators to observe the PDTB’s mini-
        mality principle, which states that the extent of the arguments to a discourse connective
        should be as minimal as possible as needed by the sense of the relation. The annotators
        are not encouraged to select distant arguments to a discourse connective but they should
        leave out certain expressions specified in the annotation manual (e.g. attribution phrases
        such as he said should be excluded).
        To evaluate the stability of the argument span annotations, we measured IAA using
        Cohen’s Kappa [11].
      The first step involves determining the boundaries of arguments, both Argument1 and
      Argument2. This is known as unitization of the data ([12, 13]). In earlier work on TDB
      1.0, the data was unitized with respect to words [3]. In the current work, we unitized the
      data with respect to characters by encoding each of them as 1 or 0 (selected/excluded);
      that is, we recorded the number of judgements a character receives for each category
      and calculated agreement over the data unitized in this manner. This encoding method
      has been considered more advantageous than the word-based encoding as it suits the
      agglutinating nature of Turkish better, enabling for example, the calculation of the agree-
      ment on argument spans to suffixal connectives. The agreement on each argument was
      measured separately. The results are given in Table 3 in Appendix C.

   All disagreements were resolved by the research team and the remaining discourse relation
tokens checked and updated where needed. The results were recorded in the data and TDB 1.2
was created. Recently added tokens yielded a corpus with the annotation categories distributed
as shown in Table 4 in Appendix C. The table reveals that the majority of the relations are
implicit amounting to 62.09% of the total number of annotated tokens as opposed to explicit
relations that constitute 37.91% of the data.


3. Common dependencies in TDB 1.2
TDB annotation style reflects the incremental interpretation of texts by humans. The annotators
are asked to read the text sentence by sentence and annotate different realizations of discourse
relations as they appear in the text, also tagging the constituents of discourse relations along
with the relations’ senses. Although they are not required to annotate any dependencies
among discourse units, by examining the annotation files produced by this annotation style,
certain dependencies can be detected, which in turn would inform us about discourse structure,
ultimately supporting discourse parsers and other language technology applications. Discourse-
level dependencies have been examined in PDTB 2.0 for English [14], over TDB 1.0 for Turkish
[15, 16], and recently for Czech [17]. In this paper, we continue this line of research started
by Lee et al. [14]. Examining TDB 1.2 with a Python script, we investigate the dependencies
among three discourse units belonging to two consecutive discourse relations related by explicit
or implicit discourse connectives (other discourse relations are out of scope of our analysis).
   The object of our investigation can be represented as: 𝐷𝑈1 - DC1 - 𝐷𝑈2 - DC2 - 𝐷𝑈3 . That
is, we deal with the dependencies among three linearly ordered discourse units (DUs), where
DU means any text span selected as an argument by one or both of the discourse connectives.
The major dependency types that we find are listed in Table 5 in Appendix C together with the
number of times each type occurs in the data.

3.1. Shared arguments
Shared arguments refer to multiple parenthood, a kind of dependency where 𝐷𝑈2 is shared by
the right side and the left side discourse connectives without any part of the argument span
being excluded (in the examples, the shared argument is shown in a double-lined frame box
to distinguish it from other DUs, which are placed in a frame box). Table 5 shows that 632
tokens (72.48% of the total number of shared arguments) are an argument to an implicit 𝐷𝐶1
shared by an implicit 𝐷𝐶2 (the Implicit-Implicit pattern in Table 5). Given the high number
of implicit discourse relations in the corpus, the common occurrence of shared arguments in
the Implicit-Implicit pattern is not unexpected. Also, recall that TDB is a multi-genre corpus
including works of fiction, where few discourse connectives tend to occur. So, the inclusion of
fiction in our corpus could be one of the reasons why implicit relations occur more frequently
than explicit ones, eventually leading to the frequent occurrence of arguments shared by implicit
relations.
   Example (3) illustrates an Implicit-Implicit dependency structure, where 𝐷𝑈2 is shared by
two implicit relations and the shared argument is syntactically a finite clause just like other DUs
in the example. Each DU of this example is a main clause expressing an independent eventuality
that can take the discourse forward. This appears to be a valid reason to make them available
for reselection.

(3)     Bu ben değildim , (çünkü) ben yere bakmazdım , (bilakis) gözüne gözüne bakardım insanların .

        This was not me (because) I would not look down , (rather) I would look into people’s eyes .

  Given the saliency of main clauses in discourse [18], their reselection is no surprise, but
are subordinate clauses shared? As already mentioned, in Turkish, postpositional and suffixal
connectives anchor non-finite (preposed) subordinate clauses. Are such clauses shared or not?
We found that such subordinate clauses can be shared, though very rarely. For example, we
found only 6 instances where the subordinate clause of a postposition is shared. Sentence (4)
presents a causal postpositional connective için (DC2), and its subordinate clause (görüşmeyi
kabul ettiği ‘accepting to meet us’) reselected without its matrix clause. Although a detailed
analysis is needed to reveal the conditions under which a preposed subordinate clause (𝐷𝑈2 ) is
shared, it appears that in (4), annotators have interpreted the eventuality described in 𝐷𝑈2 as
semantically independent possibly co-occurring with the event described in the matrix clause
(DU3). This could have triggered the subordinate clause to be reselected.

(4)     Bizi aray- -arak görüşmeyi kabul ettiği için çok teşekkür ediyoruz .

       ‘(By) Calling us   he accepted to meet with us , it’ for this reason that we are thankful to him .’


   To summarize, our analysis shows that while it is common for two adjacent implicit discourse
relations to share an argument, it is much less common for two adjacent explicit relations to
share an argument, and subordinate clauses of subordinators are shared on rare occasions.

3.2. Full embedding
Full embedding refers to cases where a discourse relation is totally realized as the argument to
the connective. It is similar to embedding in syntax and expected to occur in TDB 1.2, too.
  Indeed, it is common in the corpus, as Table 5 (Appendix C) reveals.
  Most of the fully embedded discourse relations appear in patterns where 𝐷𝐶2 is an explicit
discourse connective, either lexical of suffixal. The Implicit-Explicit pattern, for example, occurs
in 59.77% of all fully embedded instances in Table 5. This is where the second argument to an
implicit 𝐷𝐶1 is a fully embedded relation anchored by an explicit 𝐷𝐶2 .
   Example (5) is chosen from the Explicit-Explicit pattern. It presents a suffixal discourse
connective -ip ‘after’ and its binary arguments being fully embedded as an argument to a
suffixal connective on the left side, -arak ‘once’. In other words, the subordinate clause of -ip
(anneannesinin yanına gel- ‘move to her grandmother’s) is selected together with the matrix
clause, as the translation also shows.

(5)     Hukuk Fakültesini yarım bırak -arak anneannesinin yanına gel -ip Ankara’ya yerleşmesinin
       nedeni ...
       ‘the reason why after moving to her grandmother’s she settled in Ankara

       once she quitted the Law School ’ ...


    Different from example (4), this subordinate clause is not selected alone and a shared argument
structure does not arise. The selection of the entire discourse relation seems due to a semantic
reason: rather than being an independent eventuality, the event in the subordinate clause is in
a sense dependent on the event described in the matrix clause: it brings about the event in the
matrix clause. The preposed position of the subordinate clause and possibly its non-finiteness
coupled with its semantics appears to block its selection alone as an argument. Although our
annotation guidelines do not have rules regarding such subtle issues, the annotators opted to
select most of the preposed non-finite subordinate clauses together with their matrix clauses
(i.e. the entire discourse relation) as an argument, leading to fully embedded clauses or properly
contained discourse relations, which is the next topic below.

3.3. Properly contained discourse relations
Properly contained discourse relations are a subtype of fully embedded ones except that some
material is left out (shown with three dots in Ex. (6)) (the examination of the excluded part
is left for further research). Similar to fully embedded relations, properly contained relations
tend to occur in the patterns where 𝐷𝐶2 is an explicit discourse connective. For example, the
Implicit-Explicit pattern comprises 55.25% of all properly contained relations.

(6)     çarşaflarla geceden giderek terasa saklandı (sonra) ... çarşafları giy -erek terastan indi .

        he hid at the terrace with the hijab (then) ... after wearing the hijab he came down .

In Ex. (6), chosen from the Implicit-Explicit pattern, the preposed subordinate clause (𝐷𝑈2 )
and its matrix clause (𝐷𝑈3 ) are selected entirely as the second argument to 𝐷𝐶1 rather than
the subordinate clause being selected alone, which would have resulted in a shared argument
structure. Once again, this seems to be due to the position of the subordinate clause as well as
its semantics: the event described by the preposed subordinate clause çarşafları giy- ‘wear the
hijab’ engenders the main event terastan indi ‘he came down’; the man wears the hijab and only
then, he comes down from where he is hiding (otherwise, he would be noticed by the women, as
the narrative describes). These events are not inferred as independently (co-)occurring, which
seems a good reason why we find a properly contained dependency structure.
   In short, preposed (non-finite) subordinate clauses in Turkish seem to trigger full embedding
or proper containment structures, which could be explained not only by the position and
non-finiteness of the subordinate clauses but also by their semantics in relation to the matrix
clauses.


4. Summary and conclusion
We introduced TDB 1.2, a corpus that annotates different realizations of discourse relations,
their arguments and senses in the PDTB style, and found that the corpus contains more implicit
relations than explicit ones. Then, we zoomed in three types of dependency, which revealed
an asymmetry between the occurrence patterns of shared arguments on one hand and fully
embedded and properly contained discourse relations on the other. Our analyses showed that
arguments are shared frequently by two adjacent implicit discourse relations, but much less so
by two adjacent explicit discourse relations. Instead, discourse relations conveyed by explicit
connectives such as suffixal ones or postpositions tend to be selected totally as an argument to
another discourse relation, mostly an implicit one.
   Our findings have implications both for discourse parsing and the theoretical understanding
of Turkish paving the way for comparisons with other languages towards a better understanding
of discourse. While there is room for more research on both sides, the findings minimally show
that the implicit discourse relation recognition task can be improved by considering shared
arguments, which demonstrate, among others, that three adjacent implicit discourse relations
is a highly likely sequence in Turkish discourse. Also, automatic argument span detection
can be improved by considering the availability of an entire discourse relation anchored by
postpositions or suffixal connectives as an argument, as fully embedded and properly contained
dependency patterns reveal.
   What we have not examined in this paper is whether there are other factors involved in the
formation of the dependency structures described, e.g. the sense of 𝐷𝐶1 and/or 𝐷𝐶2 . The
investigation of such factors is left for further research.


Acknowledgments
We acknowledge the partial support of Middle East Technical University (BAP-07-04-2017-001)
and thank Salih Fırat Canpolat, Deniz Dilek Bilgiç, Ozan Deniz, Ali Can Serhan Yılmaz, Zeynep
Başer, Özgür Şen Bartan, Aytaç Çeltek and Murathan Kurfalı for their assistance at various
stages of the development of TDB 1.2. Any remaining errors are our own.


References
 [1] G. Eryiğit, J. Nivre, K. Oflazer, Dependency Parsing of Turkish, Computational Linguistics
     34 (2008) 357–389. doi:10.1162/coli.2008.34.4.627.
 [2] R. Çakıcı, M. Steedman, C. Bozşahin, Wide-coverage parsing, semantics, and morphology,
     in: Turkish Natural Language Processing, Springer, 2018, pp. 153–174. doi:10.1007/
     978-3-319-90165-7_8.
 [3] D. Zeyrek, I. Demirşahin, A. B. S. Çallı, Turkish Discourse Bank: Porting a discourse
     annotation style to a morphologically rich language, Dialogue & Discourse 4 (2013)
     174–184.
 [4] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. K. Joshi, B. L. Webber, The
     Penn Discourse TreeBank 2.0., in: LREC, 2008. URL: https://www.aclweb.org/anthology/
     L08-1093/.
 [5] R. Prasad, B. Webber, A. Joshi, Reflections on the Penn Discourse TreeBank, comparable
     corpora, and complementary annotation, Computational Linguistics (2014). doi:10.1162/
     COLI_a_00204.
 [6] B. Webber, M. Egg, V. Kordoni, Discourse structure and language technology, Natural
     Language Engineering 18 (2011) 437–490. doi:10.1017/S1351324911000337.
 [7] N. Asher, Reference to Abstract Objects in Discourse, Kluwer, Dordrecht, 1993.
 [8] D. Zeyrek, M. Kurfalı, TDB 1.1: Extensions on Turkish Discourse Bank, in: Proceedings of
     the 11th Linguistic Annotation Workshop, 2017, pp. 76–81. doi:10.18653/v1/W17-0809.
 [9] K. Forbes-Riley, F. Zhang, D. Litman, Extracting PDTB Discourse Relations from Student
     Essays, in: Proc. of the SIGDIAL, 2016, pp. 117–127.
[10] E. Miltsakaki, R. Prasad, A. K. Joshi, B. L. Webber, The Penn Discourse TreeBank., in:
     LREC, 2004.
[11] J. Cohen, A Coefficient of Agreement for Nominal Scales, Educational and psychological
     measurement 20 (1960) 37–46.
[12] R. Artstein, M. Poesio, Inter-coder agreement for computational linguistics, Computational
     Linguistics 34 (2008) 555–596.
[13] Ş. İ. Yalçınkaya, An inter-annotator agreement measurement methodology for the Turkish
     Discourse Bank (TDB), Master’s thesis, Middle East Technical University, 2010.
[14] A. Lee, R. Prasad, A. Joshi, N. Dinesh, B. Webber, Complexity of dependencies in discourse:
     Are dependencies in discourse more complex than in syntax, in: Proceedings of the 5th
     International Workshop on Treebanks and Linguistic Theories, Citeseer, 2006, pp. 12–23.
[15] B. Aktaş, C. Bozsahin, D. Zeyrek, Discourse Relation Configurations in Turkish and an
     Annotation Environment, in: Proc. of the 4th Linguistic Annotation Workshop, ACL, 2010,
     pp. 202–206. URL: https://www.aclweb.org/anthology/W10-1832.pdf.
[16] I. Demirsahin, A. Ozturel, C. Bozsahin, D. Zeyrek, Applicative Structures and Immediate
     Discourse in the Turkish Discourse Bank, in: Proceedings of the 7th Linguistic Annotation
     Workshop and Interoperability with Discourse, 2013, pp. 122–130. URL: https://www.
     aclweb.org/anthology/W13-2315.pdf.
[17] L. Poláková, J. Mírovskỳ, Š. Zikánová, E. Hajičová, Discourse Relations and Connectives
     in Higher Text Structure, Dialogue & Discourse 12 (2021) 1–37.
[18] W. C. Mann, S. A. Thompson, Rhetorical Structure Theory: Toward a functional theory
     of text organization, Text-Interdisciplinary Journal for the Study of Discourse 8 (1988)
     243–281.
A. Appendix: Major annotation categories and examples in TDB
   1.2
TDB 1.2 annotates implicitly and explicitly conveyed discourse relations that hold between ad-
jacent verb phrases, clauses, and sentences. This section illustrates major annotation categories
together with examples.
   Explicit relations - An explicit discourse relation holds when the relation is encoded through
an overt discourse connective.
(7)    Ali uzun boylu ama kız kardeşi kısa boyludur.
       ‘Ali is tall, but his sister is short.’

Implicit relations - In cases where an overt discourse connective is absent, an implicit discourse
relation is inferred and shown by inserting a discourse connective in the relation.
(8)    Yol kaygandı, (Imp=o yüzden) Ali arabayı dikkatli kullandı.
       ‘The road was slippery, (Imp=due to that) Ali was driving carefully.’

Alternative Lexicalization (AltLex) - When a discourse relation is alternatively lexicalized
through linguistic expressions such as despite this, because of this, the reason is, the relation is
called and AltLex.
(9)    Ali Latince öğrendi. Bundan sonra Fransızca kitap okumak çok kolay oldu.
       ‘Ali learnt Latin. After that, reading books in French has been so easy.’

Entity Relation (EntRel) - This is where the text spans express a relation with an entity.
(10)    Dr. Ahmet bey yeni bir hastahanede işe başladı. Rahmetli Dr. Ali bey’in yerini aldı.
        ‘Dr. Ahmet Beg has started to work in a new hospital. He succeeds the late Dr. Ali Beg.’

Hypophora - These are questions and meaningful answers given to the questions.
(11)    Fıkra hoşuna gitti mi? Evet bayıldım.
        ‘Did you like the joke? Yes I loved it.’

No Relation (NoRel) - A NoRel involves cases where no relation can be inferred between
adjacent text spans.
(12)    ‘Okul yakında tatile girecek. Öğretmenler okula gönderilmeyen öğrencilerle uğraşamaz.’
        Children will have a break soon. Teachers can’t deal with students not sent to school.

  Explicit and Implicit relations and AltLexes are annotated both within and across sentences,
while Hypophora tokens, EntRels, and NoRels are annotated only between adjacent sentences.


B. Appendix: Tokens recently added to TDB 1.2
The most recent additions to the corpus involve implicit verb phrase conjunctions (Ex. (13)) and
multiple relations (examples (14) - (16)).
(13)      Çabuk değişen (Imp=ve) yaşlanan bir nüfusumuz var.
          ‘We have a population that rapidly changes (Imp=and) ages.’

Multiple relations comprise:

       • the implicit senses of explicitly conveyed verb phrase conjunctions (only the senses of
         relations marked by the conjunction ve ‘and’ were considered) (Ex. (14)).
       • multiple relations between the same argument spans conveyed by co-occurring explicit
         connectives, such as ve böylece ‘and hence’ (Ex. (15)).
       • multiple relations between the same argument spans conveyed by an explicit connective
         and an AltLex, such as ve buna rağmen ‘and despite this’ (Ex. (16)).4
(14)      Okulu bıraktı ve (Imp=sonra) evlendi.
          ‘She left school and (Imp=then) got married.’
(15)      Ayşe sevdiğiyle evlendi ve böylece dünyanın en mutlu kızı oldu.
          ‘Ayşe married her beloved one and so she became the happiest women in the world.’
(16)      Ali okuldan nefret etti ve buna rağmen liseden mezun olmayı başardı.
          ‘Ali hated school and despite this he managed to finish high school.’

Multiple relations were annotated separately on each token as in the PDTB, then linked with
the same index value in their link fields.


C. Appendix: Summarization tables

Table 1
IAA results for agreement on DRs’ type of realization
                                      Realization Type         Agreement
                                      Implicit                 0.97
                                      Explicit                 0.99
                                      AltLex                   0.98
                                      EntRel                   0.95
                                      Hypophora                1.00
                                      NoRel                    0.97


Table 2
IAA results for sense agreement
                                               Explicit     Implicit     AltLex
                                  Sense
                                               (%)          (%)          (%)
                                  Level-1      99.02        99.75        100
                                  Level-2      98.66        99.75        100
                                  Level-3      80.11        79.44        79.83

    4
     PDTB 3.0 annotates multiple senses for explicit or implicit relations if annotators infer more than one sense as
holding between a pair of spans. In TDB 1.2, multiple senses were not annotated systematically.
Table 3
IAA results for argument span selection
                                      Arg Type       Cohen’s 𝜅
                                      Argument1      0.90
                                      Argument2      0.85


Table 4
Number of different realizations of discourse relations and their Level-1 sense tags in TDB 1.2
                 Expansion     Temporal      Comparison       Contingency      DRs with no sense tag      Total
 Implicit        1090          158           162              333              0                          1743
 Explicit        540           400           259              268              0                          1467
 AltLex          33            32            14               67               0                          146
 EntRel          0             0             0                0                233                        233
 Hypophora       0             0             0                0                78                         78
 NoRel           0             0             0                0                203                        203
 Total           1663          590           435              668              514                        3870


Table 5
Number of common dependencies in TDB 1.2
 DC1                            Explicit    Explicit    Implicit                   Implicit
                                                                    Sub Total                     Total
 DC2                            Explicit    Implicit    Explicit                   Implicit
 Shared Arguments               41          105         96          240            632            872
 Fully embedded DRs             117         85          471         673            115            788
 Properly Contained DRs         145         82          521         748            195            943
 Total                          303         272         1088        1663           942            2605