=Paper= {{Paper |id=Vol-2155/valunaite |storemode=property |title=Observations on the Annotation of Discourse Relational Devices in TED Talk Transcripts in Lithuanian |pdfUrl=https://ceur-ws.org/Vol-2155/valunaite.pdf |volume=Vol-2155 |authors=Giedrė Valūnaitė Oleškevičienė,Deniz Zeyrek,Viktorija Mažeikienė,Murathan Kurfalı }} ==Observations on the Annotation of Discourse Relational Devices in TED Talk Transcripts in Lithuanian== https://ceur-ws.org/Vol-2155/valunaite.pdf
  Observations on the Annotation of Discourse Relational Devices in TED Talk
                         Transcripts in Lithuanian

   Giedrė Valu̇naitė Oles̆kevic̆ienė1 , Deniz Zeyrek2 , Viktorija Maz̆eikienė1 , Murathan Kurfalı3
                                    1
                                  Institute of Humanities, Mykolas Romeris University, Vilnius
                                2
                                 Informatics Institute, Middle East Technical University, Ankara
                       3
                         Stockholm University, Stockholm and Middle East Technical University, Ankara
                                               {gvalunaite, vmazeikiene}@mruni.eu
                                      dezeyrek@metu.edu.tr, murathan.kurfali@ling.su.se

                                                                Abstract
Lithuanian researchers are working on enriching the existing corpora; they are also looking for ways to make the corpora inter-operable
and co-searchable through the annotation of discourse relations. One of the goals of the present research is working on the annotation
of discourse relations in TED talks transcripts translated into Lithuanian and expanding the set of available resources in the Lithuanian
language. A second goal is to compare cross-linguistically the annotated texts with the view of looking for translation tendencies in
rendering discourse relations in the Lithuanian language. This, we believe, will open up a new research path in digital humanities
leading to an understanding of translation tendencies in TED talks transcripts across languages. According to our research results,
noteworthy translation tendencies embrace explicitation - a tendency to use more explicitly marked discourse relations in Lithuanian
than the original transcripts, verbatim translations of discourse connectives, and also a tendency to use fewer alternative lexicalizations
(a type of discourse-relational devices).

Keywords: discourse, parallel, multilingual corpus, Lithuanian, annotation


                     1.    Introduction                                  e.g. it has preserved morphological aspects of the proto-
                                                                         language, such as the word declensions. It is spoken by
Lithuanian researchers are working on enriching the ex-
                                                                         about 2,900,000 native Lithuanian speakers in Lithuania
isting corpora and are also looking for ways to make
                                                                         and about 200,000 abroad.
the corpora inter-operable and co-searchable through the
annotation of discourse relations. One of the aims of                    There are two main resources for modern Lithuanian: (a)
the current research is extending the available resources                The 9-million-word Corpus Academicum Lithuanicum –
and lexicons of discourse-relational devices in Lithua-                  CorALit (http://coralit.lt) compiled by Vilnius University.
nian cooperating with the international team of researchers              It contains academic texts from the fields of biomedical
brought together by the European COST Project TextLink                   sciences, humanities, physical sciences, social sciences,
(http://www.textlink.ii.metu.edu.tr/). The aim is partially              and technological sciences. (b) The 102-million-word
achieved by adding Lithuanian annotated texts to the ex-                 online corpus of the Contemporary Lithuanian Language
isting TED Multilingual Discourse Bank, or TED-MDB,                      (http://tekstynas.vdu.lt), which is of general character and
a parallel corpus annotated at the discourse level follow-               includes publicist texts, fiction, non-fiction, administrative
ing the goals and principles of Penn Discourse Treebank                  literature and spoken language. However, parallel cor-
(Zeyrek et al., 2018). The second aim is to compare                      pora involving Lithuanian are still insufficient; currently
discourse-annotated texts with English annotations with a                only one parallel two-directional (English - Lithuanian and
view to understanding translation tendencies. Our ultimate               Lithuanian - English) corpus exits comprising English -
goal is to perform cross-linguistic analysis and transform               Lithuanian (70,813 parallel sentences) and Lithuanian -
this information into the domain of digital humanities. In               English (1,614 parallel sentences) (http://tekstynas.vdu.lt).
the rest of this paper, we describe the addition of Lithua-              Furthermore, the corpus is not discourse-annotated. Such
nian annotations to TED-MDB and discuss our first results                scarcity of corpora resources is an obvious barrier for
to the extent that discourse relations are concerned. This,              machine translation (Šveikauskienė and Telksnys, 2014).
we believe, will serve as the basis for our ultimate aim.                Thus, for example, the English phrase calling him a liar
                                                                         is translated into Lithuanian as skambinti jam melagis (to
                                                                         phone him a liar) in the google translate application. The
              2.    Research background                                  improvement of such issues clearly requires corpora devel-
The section provides some general insights on Lithuanian,                opment, annotation and research.
describes discourse connectives (DCs), and briefly outlines
the PDTB annotation scheme. It also describes the data and               2.2.   Discourse Connectives and an Outline of the
presents some observations about the data.                                      Annotation Scheme
                                                                         Discourse connectives signal the way the writer or speaker
2.1.   Lithuanian                                                        would like the reader or listener relate the ideas that are
Lithuanian is a very old Indo-European language. It is                   about to be said to the ideas that have been said before. Ac-
a Baltic language which has conservative morphology,                     cording to Baker (2011), DCs could be used to signal differ-

                                                                    53
ent relations and the relations could be expressed in many            annotations are saved into annotation files corresponding to
ways; for example, in English, causality might be expressed           the raw texts. They are simple text files where each token
through verbs such as cause, lead to or through DCs sig-              is stored as a series of fields, such as sense, type, argument
naling the causality relation. Languages vary in terms of             spans, delimited by the pipe symbol (|), as explained in Lee
the type of connectives preferred as well as their frequency.         et al. (2016).
Since the DCs signal the relations between pieces of infor-           Both the TED website and the WIT3 website are open re-
mation, they are related to the structuring of information            sources, which is attractive to research as they present nu-
and provide an insight into the whole logic of discourse              merous advantages, e.g. subtitles are available in a substan-
(Smith and Frawley, 1983).                                            tial number of languages, and the topics cover a wide span
The literature suggests that some languages tend to express           of knowledge fields, making the data applicable in mul-
discourse relations (DRs) through complex structures while            tiple domains (Cettolo et al., 2012). However, there are
others prefer to use simpler structures and mark discourse            also certain disadvantages of the data. Firstly, the talks are
relations explicitly, as for example, the difference between          translated by (named) volunteers. This does not necessarily
English and Arabic illustrates (Holes, 1984). The author              ensure a high-quality translation. The data is also limited
finds that while English prefers to present information in            concerning the use of parallel transcripts for DC research
smaller pieces of information and signals the relations be-           and for translation. For example, the collection of TED
tween them, Arabic prefers to group information into large            Talks is unidirectional, thus they cannot be used for exem-
discourse chunks. So the question arises how the transla-             plifying the differences for different translation directions.
tors deal with DRs when faced with the multitude of ex-               There are also other issues to deal with, such as subtitling,
plicit DCs in the source text or conversely, how they render          which is a specific type of translation (Lefer and Grabar,
DRs when there is a limited number of connectives in the              2015), and the genre of TED talks, which is a mix of spo-
source text. Given that connectives deal with the logic of            ken and written language. Finally, the variety of TED talks
the text and they are related to text interpretation, the pro-        speakers (native and non-native speakers or speakers of var-
cess of aligning the patterns of DCs with target language             ious regional varieties of English) might be another issue to
specifics and the text type of the target language is a com-          consider. Despite such issues, given the scarcity of paral-
plicated process. Translators could have two choices: for             lel texts involving Lithuanian and the limited research on
the sake of a smooth and clear translation, they could insert         Lithuanian DCs, we chose to annotate the TED talks tran-
additional DCs even when they are not used in the origi-              scripts for DRs and examine the translation issues involved.
nal text, i.e. resort to explicitation, or they could choose
to translate the explicit DCs of the original text verbatim,             3.   Annotation Procedures in Lithuanian
though the resulting translation might sound foreign in the           In Lithuanian, explicit DCs include expressions from four
target language. In practice, translators choose something            grammatical classes: subordinating conjunctions – e.g. kai,
in between or use a bit of both techniques (Baker, 2011).             kol, nes, kadangi (when, while, because, since), coordinat-
The PDTB is a 2-million-word corpus manually anno-                    ing conjunctions – ir, bei, o, tačiau (and, but, or, however),
tated for discourse-level information (Prasad et al., 2014).          sentential relatives – tam kad, tuo metu kai (so that, at the
The annotation scheme mainly includes explicit and im-                time when), and discourse adverbials – faktiškai, galiau-
plicit DCs, alternative lexicalizations, entity relations, no         siai (actually, eventually). The main task is to identify if
relations, and their binary arguments, called Arg1 and                the words and phrases function as explicit DCs as they can
Arg2. Senses are assigned to all DRs except entity re-                have other functions. As in the PDTB, five types of rela-
lations and no relations. PDTB’s annotation approach is               tions are identified and annotated: Explicit relations, im-
theory-neutral and lexically grounded. The theory-neutral             plicit relations, alternative lexicalizations, entity relations,
approach means that the annotation is not based on a spe-             and no relations. The argument annotation of explicit DCs
cific discourse theory. Lexically grounded perception im-             and alternative lexicalizations follows the rule that the ar-
plies that annotator judgments are effectively elicited both          gument which appears as syntactically bound to the DC is
for explicit DRs and implicit DRs; i.e. even for cases where          marked as Arg2; the other argument is annotated as Arg1.
there are no explicit markers of the relation.                        As in TED-MDB, adverbials called “discourse markers”
                                                                      (Hirschberg and Litman, 1987) are not annotated as they
2.3.   The Data                                                       signal the organizational structure of the discourse rather
Our data comprise Lithuanian TED talks transcripts of the             than relating two arguments semantically. For example,
original English texts included in TED-MDB (Table 1).                 Lithuanian dabar and its English equivalent (now) in the
TED-MDB is created on the basis of PDTB 3.0 relation hi-              examples below serve to signal discourse organizational
erarchy (Webber et al., 2016). The PDTB is chosen mainly              structure, so such cases were not annotated.
because it has been used reliably to annotate discourse in              1. Dabar kaip matote ˛itampa apie kuria˛ girdėjome
other languages, e.g. Turkish (Zeyrek et al., 2013), Ara-                  San Fransiske apie susirūpinima˛ dėl būsto kainu˛ ir
bic (Al-Saif and Markert, 2010), Chinese (Zhou and Xue,                    gyventoju˛ išstūmimo ir technologiju˛ kompaniju˛, ku-
2012), and Hindi (Oza et al., 2009). The corpus already                    rios atneša daug turto ir ˛isikuria, yra tikra.
includes transcripts of 6 languages: Turkish, English, Pol-
ish, German, Russian and Portuguese. As in the TED-MDB                  2. Now you can see, though, that the tensions that we’ve
project, Lithuanian transcripts are retrieved from the WIT3                heard about in San Francisco in terms of people be-
website Cettolo et al. (2012) and annotated for DRs. The                   ing concerned about gentrification and all the new tech

                                                                 54
            Talk ID                                    Title/Speaker                                 Word count Eng./Lith.
             1927              The investment of logic for sustainability (Chris McKnett)               1,614 (1,345)
             1978                         Embrace the near win (Sarah Lewis)                            1,772 (1,362)
             2009                     A glimpse of life on the road (Kitra Cahana)                        694 (512)
             2150       Social maps that reveal a city’s intersections and separations (Dave Troy)       1,053 (678)
            TOTAL                                                                                       5,133 (3,897)

                      Table 1: The English and the Lithuanian sections of the corpus included in the study


     companies that are bringing new wealth and settle-                 No relation (NoRel) is annotated when there is no DR in-
     ment into the city are real.                                       ferred by the reader between the adjacent sentences:
According to PDTB annotation guidelines, in annotating                   9. Tai 4 milijardai viduriniosios klasės žmoniu˛, kuriems
implicit DRs, the annotator has to insert a DC that best                    reikia maisto, energijosir vandens.        Dabar jūs
expresses the inferred relation between two adjacent sen-                   tubūt klausiate sav˛es: gal tai tik pavieniai atvejai.
tences. This procedure is adopted, as in Lithuanian exam-                   (NoRel)
ple 3 and its English equivalent in 4. In all the examples,
Arg1 is shown in italics, Arg2 is shown in boldface.                    10. That’s four billion middle class people demanding
                                                                            food, energy and water. Now, you may be asking
 3. Ji tokie sudėtingi ir gali atrodyti mums tolimi, kad                   yourself, are these just isolated cases. (NoRel)
    galime būti link˛e daryti štai ka:
                                      ˛ slėpti galva˛ smėlyje
    ir negalvoti apie tai. [Implicit=Bet] Jei tik galite,               TED-MDB adds a new top-level category to the PDTB 3.0
    priešinkitės tam. (Implicit) (Comparison: Contrast)                relation hierarchy, called hypophora. This category aims to
                                                                        capture rhetorical question-response pairs, where the ques-
 4. ...bury our heads in the sand and not think about                   tion is asked and answered by the speaker. TED-MDB
    it. [Implicit=But] Resist this, if you can. (Implicit)              annotates hypophora as a case of AltLex anchored by the
    (Comparison: Contrast)                                              question word. Where possible, the additional sense of the
                                                                        Q/R pair may be added.
Alternative lexalization (AltLex) includes cases of in-                 As in TED-MDB, in Lithuanian, we annotate the question
ferred DRs between adjacent clauses, where redundancy                   as Arg2, the answer as Arg1. We consider the question
appears if an explicit DC is inserted. The reason for this is           as Arg2 because the AltLex is part of the question. The
that the relation is already expressed by some alternatively            question word (either the wh-word or ar, a specific question
lexicalized non-connective expression, e.g.                             particle used in Yes/No questions, which can also serve as
 5. Sėkmė mus motyvuoja, bet beveik pasiekta pergalė                 an explicit DC in Lithuanian) is selected as AltLex since it
    skatina mus leistis ˛i nuolatinius ieškojimus. [Viena˛ iš           marks the DR holding between the question and the answer,
    ryškiausiu˛ to pavyzdžiu˛ pastebime], kai žvelgiame ˛i              as in example 11 and its equivalent in 12:
    skirtuma˛ tarp olimpinio sidabro laimėtoju˛ ir bron-               11. Niekas nepasikeis, [ar] mes bandysime pakeisti, [ar]
    zos laimėtoju˛ rungtynėms pasibaigus. (AltLex) (Ex-                   tu nieko nebandysi (Explicit) (Expansion: Disjunc-
    pansion: Instatiation)                                                  tion)
 6. Success motivates us, but a near win can propel us in               12. Nothing is going to change [either] we try to change
    an ongoing quest. [One of the most vivid examples of                    something [or] you don’t try anything. (Explicit)
    this comes] when we look at the difference between                      (Expansion: Disjunction)
    Olympic silver medalists and bronze medalists af-
    ter a competition. (AltLex) (Expansion: Instantia-                  In the following pairs of examples, we provide more cases
    tion)                                                               of how hypophora is annotated in Lithuanian and English.
                                                                        Lithuanian Q/R pairs are annotated for a primary sense, and
Entity relations (EntRel) are annotated between adjacent                tagged as hypophora as the secondary sense.
sentences when an entity in one argument is described fur-
ther in the other argument, as in 7 and its English version in          13. [Ar] ˛imonės, atsižvelgiančios ˛i tvaruma,˛ išties finan-
8.                                                                          siškai sėkmingos? galintis nustebinti atsakymas yra
                                                                            “taip" (Explicit) (Altlex: Ar; Expansion: Level-of-
 7. Jie turėtu˛ ˛ivertinti ir tuos efektyvumo rodiklius, kuri-             detail:Arg1-as-detail; Hypophora).
    uos vadiname ASV: aplinkosauga, socialiniai klausi-
    mai ir valdymas. Aplikosauga apima energijos var-                   14. [Do] companies that take sustainability into ac-
    tojima,˛ prieiga˛ prie vandens, atlieku˛ tvarkyma˛ ir                   count really do well financially? The answer that
    tarša˛ ir ekonomiška˛ ištekliu˛ naudojima.˛ (EntRel)                    may surprise you is yes. (AltLex: Do) (Hypophora)
 8. Investors should also look at performance metrics in                15. [Kodėl] kas nors apskritai rinktu˛si toki˛ gyvenima˛
    what we call ESG: environment, social and gover-                        - Atsakymas ˛i ši˛ klausima˛ gali skirtis, kaip skiriasi ir
    nance. Environment includes energy consumption,                         žmonės sutinkami kelyje, bet keliautojai dažnai atsako
    water availability, waste and pollution, just making                    vienu žodžiu: laisvė. (Explicit) (Altlex: Kodėl; Con-
    efficient uses of resource. (EntRel)                                    tingency: cause: Reason; Hypophora).

                                                                   55
16. [Why] anyone would choose a life like this, under                                  Relation Type    English   Lithuanian
    the thumb of discriminatory laws, eating out of                                       AltLex          33          7
    trash cans, sleeping under bridges, picking up sea-                                   NoRel           38          24
                                                                                         Explicit        225         297
    sonal jobs here and there. The answer to such a
                                                                                         Implicit        132         177
    question is as varied as the people that take to the
                                                                                          EntRel          43          44
    road, but travelers often respond with a single word:
    freedom. (AltLex: Why)(Hypophora)                                      Table 3: Frequencies of annotated relation types in 4 tran-
                                                                           scripts in English and Lithuanian
  4.    Intra- and Inter-Annotator Agreement
The stability of the annotation scheme is evaluated both                              Top-level Sense   English    Lithuanian
by intra- and inter-annotator agreement. One transcript                                 Temporal          24           25
(Text ID 1978), which comprises approximately 25% of                                   Comparison         57           66
the Lithuanian section of the data is reannoted by the pri-                             Hypophora          9           13
mary annotator after about 2 months of the first annotation,                            Expansion        213          262
                                                                                       Contingency        94          127
and it is annotated independently by the secondary annota-
tor (cf. Table 2 for the distribution of the annotated, rean-              Table 4: Frequencies of annotated top-level senses of the
noated and independently annotated DR types).1 We mea-                     PDTB scheme including Hypophora in 4 transcripts in En-
sured F1 score, which evaluates agreement between the an-                  glish and Lithuanian
notators regarding the existence of a DR between the same
discourse units. To measure agreement on the types and
senses of these DRs, we calculated Cohen’s Kappa (Co-                      Another interesting feature observed is that there are more
hen, 1960), which is known to be a robust method to eval-                  explicit DRs in the Lithuanian transcripts than in the En-
uate agreement on categorical items as it takes the chance                 glish versions. This might be explained by the translators’
agreements into account. In this preliminary evaluation ex-                effort to render the implicit DRs in English explicitly. There
ercise, we reached very high scores on both measures: The                  are also cases where implicit DRs in English texts are trans-
F1 scores for intra- and inter-annotator agreement are 0.933               lated explicitly to Lithuanian, which goes in tune with ex-
and 0.944, respectively. The Kappa values for intra- and                   plicitation, as observed by Baker (1996). For example:
inter-annotator type agreement are 0.974 and 0.991, respec-
tively; the Kappa values for intra- and inter-annotator sense              17. ...   that’s okay, right.   [Implicit=But] We
agreement are 0.967 and 0.989, respectively.                                   want more. (Implicit) (Comparison: Concession:
                                                                               Arg2_as_denier)
                        Primary annotator     Secondary annotator
 Relation Type        1st annot 2nd annot                                  18. Nebogai, tiesa. [Bet] mes norim daugiau. (Explicit)
    AltLex                 -         2                  -                      (Comparison: Concession: Arg2_as_denier)
    NoRel                 15         15                13
   Explicit              105        107               101                  However, there are also cases when the explicit DCs are
   Implicit               48         53                44                  rendered implicitly, which might lead to the loss of the
    EntRel                28         30                27                  sense annotated in the original text. For example:

Table 2: Frequencies of annotated, reannotated and inde-                   19. ... only looking at race doesn’t really contribute to our
pendently annotated DR types in one Lithuanian transcript                      development of diversity. [So] if we’re trying to use
                                                                               diversity as a way to tackle some of our more in-
                                                                               tractable problems, we need to start to think about
                 5.     Research Findings                                      diversity in a new way. (Explicit) (Contingency:
In this section, we focus on the whole unit of the annotated                   Cause: Result)
texts in English and Lithuanian and present the frequencies
                                                                           20. ... žiūrėti tik ˛i ras˛e nepadeda bandant prisidėti
of annotated DR types (Table 3) as well as the frequencies
                                                                               prie ˛ivairumo vystymo.         [Implicit=Taigi] Ban-
of the annotated top-level senses (Table 4). We then discuss
                                                                               dome ˛ivairuma˛ naudoti sprendžiant kai kurias
the results.
                                                                               sudėtingesnes problemas, turime pradėti kitaip
In Table 3, the low frequency of AltLex annotations in
                                                                               galvoti apie ˛ivairuma.˛      (Implicit) (Contingency:
Lithuanian could reveal a certain tendency characteristic re-
                                                                               Cause: Result)
flecting the translators’ choices while translating the DCs -
it appears that the translators tended to render DCs by the                21. [If] we’re trying to use diversity as a way to tackle
variants provided by dictionaries rather than using AltLexs,                   some of our more intractable problems, we need to
e.g. kai (when), kol (while), nes (because), nes (since), etc.                 start to think about diversity in a new way. (Explicit)
This resonates with Baker´s (Baker, 2011) observations in                      (Contingency: Condition: Arg2_as_condition)
that translators might choose to align the patterns of DCs
with the target language.                                                  22. Bandome ˛ivairuma˛ naudoti sprendžiant kai kurias
                                                                               sudėtingesnes problemas, [Implicit=todėl] turime
   1                                                                           pradėti kitaip galvoti apie ˛ivairuma.˛ (Implicit)
     The primary and the secondary annotators are the first and the
third authors of the study.                                                    (Contingency: Cause: Result)

                                                                      56
Examples 19-20 and 21-22 show that the translator chose                of Scientists, other Researchers and Students through Prac-
not to render the explicit DCs so and if. However, even                tical Research Activities” measure. For training in anno-
though the sense of ‘result’ could be felt implicitly in 20, in        tation and generating ideas for research, we acknowledge
22, we observe a meaning loss, where the sense of ‘condi-              the support of the STSM grants by TextLink COST action
tion’ is totally lost.                                                 IS1312.
Finally, the annotation of EntRels also revealed some inter-
esting cases. We observed that in some Lithuanian transla-                                   References
tions, the EntRel is present in two loosely related sentences          Al-Saif, A. and Markert, K. (2010). The Leeds Arabic Dis-
as in 23, while in the source English text there is just one             course Treebank: Annotating discourse connectives for
sentence lacking two separate arguments (see 24):                        Arabic. In LREC.
23. Tad pasakysiu kai ka,˛ kas gali jus nustebinti:                    Baker, M. (1996). Corpus-based translation studies: The
    galios balansas, galintis išties paveikti tvaruma,˛ yra              challenges that lie ahead. Benjamins Translation Li-
    instituciniu˛ investuotoju˛ rankose. Tai tokie didieji in-           brary, 18:175–186.
    vestuotojai kaip pensiju˛ fondai, kiti fondai ir lab-
    daros fondai. (Entrel)                                             Baker, M. (2011). In Other Words: A Coursebook on
                                                                         Translation. Routledge.
24. And here’s something that may surprise you: the bal-
    ance of power to really influence sustainability rests             Cettolo, M., Girardi, C., and Federico, M. (2012). WIT3:
    with institutional investors, the large investors like               Web Inventory of Transcribed and Translated Talks. In
    pension funds, foundations and endowment.                            Proceedings of the 16th Conference of the European As-
                                                                         sociation for Machine Translation (EAMT), volume 261,
Concerning the frequencies of the top-level senses of DRs,               page 268.
the distribution seems to be approximately equal for both
languages as indicated in Table 4.                                     Cohen, J. (1960). A coefficient of agreement for nomi-
                                                                         nal scales. Educational and Psychological Measurement,
   6.    Summary, Conclusions and Outlook                                20(1):37–46.
The research findings presented here represent our initial
observations and reveal certain tendencies in rendering the            Hirschberg, J. and Litman, D. (1987). Now let’s talk about
discourse of English TED talks in Lithuanian. Our focus                  now: Identifying cue phrases intonationally. In Proceed-
has been on how DRs are expressed in Lithuanian tran-                    ings of the 25th Annual Meeting on Association for Com-
scripts. We observed that there are more explicit DRs in                 putational Linguistics, pages 163–171. Association for
the Lithuanian transcripts, which might be explained by the              Computational Linguistics.
translators’ efforts to render the implicit DRs explicitly -           Holes, C. (1984). Textual approximation in the teaching
this goes in tune with the observations of Baker (2011). On              of academic writing to Arab students: A contrastive ap-
the other hand, we noticed that the rendering of explicit                proach. English for Specific Purposes in the Arab World,
DCs implicitly might lead to the loss of the sense annotated             pages 228–242.
in the original text. Such choices of the translator could ob-
scure the meaning of the original, and could be explained              Lee, A., Prasad, R., Webber, B. L., and Joshi, A. K. (2016).
by the requirements of synchronization during transcript                 Annotating discourse relations with the PDTB Annota-
translation. These might be the effect of the issues dis-                tor. In COLING (Demos), pages 121–125.
cussed by Lefer and Grabar (2015) who identify subtitling
                                                                       Lefer, M.-A. and Grabar, N. (2015). Super-creative and
as a specific type of translation. The annotation of entity re-
                                                                         over-bureaucratic: A cross-genre corpus-based study on
lations also reveals interesting cases, such as the translation
                                                                         the use and translation of evaluative prefixation in TED
of a single English sentence into two loosely related argu-
                                                                         talks and EU parliamentary debates. Across Languages
ments in the Lithuanian EntRel version. Finally, it should
                                                                         and Cultures, 16(2):187–208.
be kept in mind that there could be some stylistic prefer-
ences of the translators, e.g. some translators might want to          Oza, U., Prasad, R., Kolachina, S., Sharma, D. M., and
use more explicit connectives, some less. The investigation              Joshi, A. (2009). The Hindi Discourse Relation Bank. In
of individual translators’ choices could be a specific further           Proc. of the 3rd Linguistic Annotation Workshop, pages
research topic.                                                          158–161. Association for Computational Linguistics.
In the future, by annotating more of the Lithuanian tran-
scripts of the English texts in TED-MDB, we hope to reveal             Prasad, R., Webber, B., and Joshi, A. (2014). Reflections on
and specify more translation tendencies. Also, by exploring              the Penn Discourse Treebank, comparable corpora, and
the transcripts further, we expect to find out what transla-             complementary annotation. Computational Linguistics.
tion strategies (direct translation, transposition, etc.) are          Smith, R. N. and Frawley, W. J. (1983). Conjunctive cohe-
preferably employed by the translators and what this may                 sion in four English genres. Text-Interdisciplinary Jour-
add to the research field of digital humanities.                         nal for the Study of Discourse, 3(4):347–374.
               7.    Acknowledgements                                  Šveikauskienė, D. and Telksnys, L. (2014). Accuracy of
This research is funded by the European Social Fund under                the parsing of Lithuanian simple sentences. Information
the No 09.3.3-LMT-K-712 “Development of Competences                      Technology and Control, 43(4):402–413.

                                                                  57
Webber, B., Prasad, R., Lee, A., and Joshi, A. (2016). A
 discourse-annotated corpus of conjoined VPs. In Pro-
 ceedings of the 10th Linguistic Annotation Workshop
 held in conjunction with ACL 2016 (LAW-X 2016), pages
 22–31.
Zeyrek, D., Demirşahin, I., Sevdik-Çallı, A., and Çakıcı,
  R. (2013). Turkish Discourse Bank: Porting a discourse
  annotation style to a morphologically rich language. Di-
  alogue and Discourse, 4(2):174–184.
Zeyrek, D., Mendes, A., and Kurfalı, M. (2018). Multi-
  lingual extension of PDTB-style annotation: The case of
  TED Multilingual Discourse Bank. In LREC 2018.
Zhou, Y. and Xue, N. (2012). PDTB-style discourse anno-
  tation of Chinese text. In Proceedings of the 50th Annual
  Meeting of the Association for Computational Linguis-
  tics: Long Papers-Volume 1, pages 69–77. Association
  for Computational Linguistics.




                                                              58