Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study? Jelena Sarajlić1[0000−0003−0986−3972] , Gaurish Thakkar1[0000−0002−8119−5078] , Diego Alves1[0000−0001−8311−2240] , and Nives Mikelic Preradović1[0000−0001−9087−0074] Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb 10000, Croatia jelenasarajlic2@gmail.com, {dfvalio, nmikelic}@ffzg.hr, gthakkar@m.ffzg.hr Abstract. This paper presents a corpus annotated for the task of direct- speech extraction in Croatian.The paper focuses on the annotation of the quotation, co-reference resolution, and sentiment annotation in SETimes news corpus in Croatian and on the analysis of its language-specific dif- ferences compared to English. From this, a list of the phenomena that require special attention when performing these annotations is derived. The generated corpus with quotation features annotations can be used for multiple tasks in the field of Natural Language Processing. Keywords: reported-speech · linguistic-phenomenon · resource-creation. 1 Introduction Quotes are an essential part of news articles and stories made by the media, individuals, or other organisations. The reproduction of the spoken-text includes public opinion which expresses personal and subjective information about events of the world surrounding us. Political analysts and researchers have a major interest in analysing quotations [4] as it allows a better understanding of the political dynamics between entities as well as the identification of the correct source of claims and assertions [16]. Although recent advances in Machine Learning and Natural Language Pro- cessing do provide ample support for extracting quotes and identifying the speak- ers in English texts, very little research has been done for other languages. There- fore, this paper proposes an original approach for the creation of a corpus of news texts with quotes annotations for the Croatian language, with a deep analysis of the corpus genesis process. This news corpus is tagged with quotations, verb- cues, and speakers’ identification. It also includes co-reference resolution in case ? Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Sarajlić et al. of pronouns involved in the quotations. Finally, the quotes were tagged in terms of sentiment (positive, negative, or neutral) of the spoken text at the sentence level. This paper is organized into 6 sections. Section 2 summarizes the related work and the dataset alongside the annotation process methodology is described in Section 3. Section 4 presents several statistical information concerning the generated corpus, followed by Section 5, illustrating the various phenomena that were encountered. Finally, in Section 6, the main conclusions of this study are presented. 2 Overview and Related Work The following sentences are examples of the most common syntactic arrange- ments of direct quotations that can be found in news articles in English. 1. Michelle Mcgrath said, “We stand ready to support you in every way”. 2. “We stand ready to support you in every way” Blair said. 3. Tony Mcgrath visited Iraq... He said, “We stand ready to support you in every way”. 4. Tony Mcgrath visited Iraq... “We stand ready to support you in every way” the Prime Minister said. 5. “I’m really happy for Fabio” Materazzi told the Apcom news agency Friday. “I feel part of this distinction because I think that all the Azzurri helped a great champion like Cannavaro win an important prize”. The main task of quote extraction is composed of the following sub-tasks [12]: Spoken-text extraction, Speaker identification, and Verb-cue classifier. Spoken- text extraction deals with extracting the quote’s content out of the text. Speaker identification is the task of identifying the correct speaker of the extracted quote and attributing him/her to the quoted content. Verb-cue classifier must recognize the verb introducing the quote. This verb is often referred to as the ‘verb-cue’ (for example, “say”, “state”, and other quoting verbs). Additionally, pronouns, such as presented in case 3, must be resolved. Hence, a co-reference chain connecting the antecedent to the pronoun is needed. Furthermore, to enrich the corpus with subjective information, the text is also tagged at the sentence level in terms of sentiment (positive, negative, and neutral). Previous attempts of automatic extraction of quote-speaker pairs from news sources have been well-researched for English [12,7]. A sieve-based system in the literary text along with the dataset “QuoteLi3” was presented by [10]. Sentiment analysis based on quotations has been extensively studied [2,1,3] on English but no previous research has been done for Croatian. 3 Corpus 3.1 SETIMES The SETimes Croatian corpus was used as the basis for the annotation process. SETimes is a parallel corpus (CC-BY-SA license) [13] based on the contents pub- Title Suppressed Due to Excessive Length 3 lished on the SETimes.com news portal, concerning “news and views from South- east Europe” and covering, in total, ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian, Serbian, and Turkish. The Croatian corpus is composed of 2.7 million words and 197,559 sentences.1 3.2 Data Pre-processing We merge all the sentences belonging to a single article into one document by concatenating with a space delimiter. From the whole recomposed corpus, a sub-corpus of 140 random documents was selected for the annotation task. 3.3 Annotation INCEpTION tool [6] was chosen for performing all annotations. Three custom layers were employed: – Quote Fine - For tagging the various quotation components: speaker/source, verb-cue, and spoken-text. This is a span-based annotation. – Quote Simple - For tagging the quoted text in terms of its sentiment. This is a span-based annotation which takes multiple sentences and tags them separately with the corresponding sentiment, namely positive, negative, and neutral. – Quote Co-reference - For tagging the 3 different possible types of relations identified in quotations. • Anaphoric - The connection of a pronoun present either in the quoted text and/or as the speaker via a co-reference chain. • Uses-verb - Connects the speaker and verb-cue as a chain. • Verb-spoken-text - Connects the verb-cue to the spoken-text span as a chain. The annotations of the Croatian corpus were performed by a Croatian na- tive speaker. These annotations serve as a preliminary step to try and assess the possible problems that could arise during the annotation processes and what fea- tures of quoting in Croatian should be noted when building a tool for automatic processing of quotes. 4 Dataset Statistics This section describes the overall statistics of the created dataset. We have a total of 2497 annotations concerning speaker and quote identification and types relations. In total, 469 quotes were found in 140 different documents, around 3 quotes per article. In terms of sentiment tagging, we have annotated 875 indi- vidual sentences. 1 The corrected version of SETimes corpus where diacritics and encoding system have been corrected is available from nlp.ffzg.hr. This version is considered in this paper. 4 Sarajlić et al. Annotation Class No of Annotations Speakers 446 Spoken-text 469 Verb-cue 468 Anaphoric Relation 13 Uses-verb Relation 431 Verb-spoken-text Relation 515 Total 2497 Table 1. (Left) Number of annotations in the dataset, except for sentiment tags. Class Sentences Distribution(%) Positive 284 0.32 Neutral 313 0.35 Negative 278 0.31 Total 875 Table 2. Sentiment distribution in the dataset. 5 Specific Croatian Quotation Linguistic Phenomena Quotes can be categorized into different categories, depending on which feature one would like to emphasize. In this work, the categorization stated in O’Keefe et al. [11] was used. This means differentiating between three quote types: – Indirect quotes - The quoted content is not inside quotation marks and does not follow the speaker’s words precisely, i.e. it is in some way changed or paraphrased. – Direct quotes - The quoted content in its entirety is inside quotation marks and portrays the speaker’s words verbatim. – Mixed quotes - The quoted content has both direct and indirect parts, meaning that some of the speaker’s words are precisely portrayed, while others are paraphrased. The work described in this paper is focused only on direct quotes. Quote and quotation are used interchangeably, and both can indicate the whole quote (speaker+verb-cue+spoken-text) or only the quote content (‘spoken-text’), de- pending on the context. The ideal quotation would be the one where the speaker, verb-cue, and spoken-text would be consecutively positioned in a sentence and not broken apart in any way. Often this is not the case in the corpus (nor real life), as appositions, time and date details, and other sentence parts are posi- tioned in between the speaker, verb-cue, and spoken-text. To try and grasp all potential issues for automatic detection, attribution, and extraction of quotes, everything that was not deemed as the ideal quotation was written down as a problem during the manual annotation, separately from the annotations de- scribed above. This led to many instances being marked as potential problems, Title Suppressed Due to Excessive Length 5 but by identifying even the smallest deviations from the envisioned standard we can more easily group problems and try to adapt the automatic processing to most commonly found issues. Croatian is a South Slavic language and therefore has some distinct linguistic features when compared to English. Some problems encountered in English automatic quotation processing, such as pronoun disam- biguation, do not concern Croatian (at least not to the same extent), and vice- versa - instances which do not exist in English pose a problem in Croatian. In the following subsections, we will briefly describe all of the problems encountered during the annotation process, with a more detailed focus on language-specific ones, except for one problem because it is strictly a problem with the INCEp- TION tool. Altogether, 16 problem clusters were recognized – 15 in the Quote Fine layer and one in the Quote Simple layer. All of the problems are listed in Table 3, along with the percentage of documents2 in which they were noticed. It is worth men- tioning that this is merely an indicator of how many documents had a certain problem, and not how prevalent a problem might be inside a certain document (i.e. in one document the problem might arise 10 times, and only once in another document). Still, it was assessed that this could be useful as a rough estimate of problem distribution throughout the annotated documents. All of these phenomena were named by the authors for the purpose of this work (except for cross-branching and passive) and are not official names for such occur- rences. The tags in examples were generated by the ReLDI tagger3 for Croatian. 5.1 Indirectly correlated speaker and Indirectly correlated verb-cue Indirectly correlated speaker and Indirectly correlated verb-cue were noted as two different problems as the former deals only with the speaker, and the latter only with verb-cue issues. However, they will be described together as they essentially represent the same issue, i.e. one of the other two constituents of a quote being separated from the spoken-text. This could create difficulties with automatic extraction because the speaker and verb-cue can in some cases be far away from the spoken-text. This can happen with the insertion of time and date details, appositions, or other sentence parts between the speaker, verb-cue, and spoken- text (the quoted content itself). While a human reader probably would not have any problems connecting the correct speaker or verb-cue to its spoken-text, the presumption was that this could pose a problem for the process of automatic extraction and disambiguation of quotes. While it would be possible to filter out the most common words and expressions separating the two, such as well- known appositions or time/location expressions, some less frequent instances could create a problem. This is why it was decided to note every case where the 2 This percentage was calculated with dividing the number of documents containing a certain problem with the number of documents which had direct quotes (120 documents), since only those documents were actually annotated. 3 ReLDI tagger, available on http://www.clarin.si/services/web/query 6 Sarajlić et al. Case % of documents Quote on multiple pages [tool problem] Indirectly correlated speaker 61.67 Indirectly correlated verb cue 12.50 Speaker or verb-cue in the middle of the quoted text 41.67 Quotes without an apparent speaker or verb-cue 11.67 Quotes marking something other than speech 33.33 Quotes from parts of documents or a collective 27.50 Seems to be quoted but no quotes 1.67 Speaker mentioned multiple times differently 63.33 Apposition(s) alongside speaker 78.33 Mixed quotes 54.17 Non-person speaker 9.17 Anonymous speaker 15.83 Cross-branching 5 Passive 18.33 Sentence parts differently annotated 32.50 Table 3. Problem clusters and the percentage of documents in which they were noticed. speaker or verb-cue were separated from the spoken-text, with no regard to if there was only one word separating them or whole paragraphs. This can offer an insight into how this separation happens (in our dataset) and how it could be tackled. Example 6. demonstrates an occurrence of Indirectly correlated speaker. ‘Sarah Lum’ is the speaker whose words were quoted in the following sentence of the document, meaning that there are five tokens separating the speaker from her quote and the verb-cue. In this case, there is an apposition with additional details about the institution and its location (‘predstavnica Američkog ureda u Prištini’, en. ‘representing the US Office in Pristina’) between the speaker and the quote. 6. Croatian: Sličnu izjavu dala je Sarah Lum, similar-Agpfsay statement-Ncfsa give-Vmp-sf be-Var3s speaker predstavnica Američkog ureda u Prištini. representative-Ncfsn American-Agpmsgy office-Ncmsg in-Sl Pristina-Npfsl “QUOTE”, kazala je. quote say-Vmp-sf be-Var3s verb-cue Title Suppressed Due to Excessive Length 7 English: Similar comments came from Sarah Lum, representing the US office in Pristina. “QUOTE,” she said. In Example 7., a case of Indirectly correlated verb-cue in a mixed quote can be seen. Verb-cue ‘izjavio je’ (en. ‘has stated’) is separated from the direct part of the mixed quote by the indirect part of the quote. Mixed quotes were not the focus of this work, but the direct part of a mixed quote was nevertheless annotated along with its speaker and verb-cue when judged possible or needed. It is interesting to note that there were no annotated documents which had Indirectly correlated verb-cue where there was not also an Indirectly correlated speaker 4 , even though there were many documents with instances of Indirectly correlated speaker, but no Indirectly correlated verb-cue. Since these observations are made only on the document, and not on the quote-level, more detailed research would be needed to establish whether this is merely a coincidence or not. 7. Croatian: “QUOTE” diplomatski su ciljevi Ankare quote diplomatic-Agpmpny be-Var3p goals-Ncmpn Ankara-Npfsg na Balkanu, izjavio je Gürkan Zengin (...) on-Sl Balkan-Npmsl state-Vmp-sm be-Var3s speaker verb-cue English: “QUOTE” are Ankara’s Balkan diplomacy goals, according to Gürkan Zengin (...) 5.2 Speaker and/or verb-cue in the middle of the quoted text This problem cluster grouped together all cases where one quote’s content was separated by other text in which both the speaker and/or the verb-cue occurred. This type of quoting could be seen in Example 5. in Section 2. This meant that a quote that would be understood by a human reader as one individual quote was separated into multiple spoken-texts. This could create problems with automatic extraction of quotes because it could happen that only a part of the quote’s content would be extracted with the speaker and verb-cue, while the other part could be ignored, extracted without speaker/verb-cue, or with them wrongly attributed. Example 8. demonstrates a quote from one Chernomorets resident, whose statement was separated by a short text with time (‘krajem rujna’, en. ‘end of September’) and source details (‘za SETimes’, en. ‘for SETimes’), but also mentions speaker and verb-cue. Tags ‘Quote1.1’ and ‘Quote1.2’ in the example 4 There was one exception, but the Indirectly correlated verb-cue occurred in a passive clause with no agent, so there was no speaker in the first place. 8 Sarajlić et al. tag the first and the second part of his quote, respectively. After the second part of the quote, the article in question continues with its topic with no further reference to the fisherman or his words. Since there are no clear indications about speaker and verb-cue for the second part of the quote it could, in theory, easily be mistakenly extracted as a quote with no speaker/verb-cue (or simply ignored because of “lack” of these constituents) or some other speaker and verb-cue could falsely be attributed to it. This is why it was thought important to note these cases so some data about how to most successfully try and add separated quote contents to each other in later steps of our work was available. 8. Croatian: “QUOTE1.1”, rekao je krajem part 1/2 of the quote say-Vmp-sm be-Var3s end-Sg verb-cue rujna za SETimes Angel Kishev September-Ncmsg for-Sa SETimes-Npmsan speaker 80-godišnji ribar iz grada 80-year- old-Agpmsny fisherman-Ncmsn from-Sg town-Ncmsg Chernomoretsa. “QUOTE1.2” Chernomorets-Npmsg part 2/2 of the quote English: “QUOTE1.1,” Angel Kishev, an 80-year-old fisherman from the town of Chernomorets, told SETimes in late September. “QUOTE2.2” 5.3 Quotes without an apparent speaker or verb-cue Problem cluster Quotes without an apparent speaker or verb-cue deals with those quotes whose speaker and/or verb-cue would be apparent to a human reader, but not necessarily to a tool for automatic quotations extraction. This could lead to quotes which are part of this cluster to be ignored or wrongly ex- tracted/attributed. This was already partially discussed in Subsection 5.2, where it was described how some parts of the separated quote could fall into this cat- egory. However, there are other cases in which this problem arises. Example 9. shows a sentence before the problematic quote. After the quote, the article con- tinues with no additional information about its speaker or verb-cue. The sentence shown in the example contains a quote from, presumably, a different source - it seems to list the UN’s endorsed standards. The real quote from Holkeri follows after that sentence. The system or the tool for automatic processing of quotes would first have to recognize that the inserted quote is not Holkeri’s quote, but rather a part of some document. The next step would be to attribute Holkeri’s Title Suppressed Due to Excessive Length 9 actual quote to him, even though it does not have a specific verb-cue, is not in the same sentence as the speaker, and one has to infer that this indeed is Holkeri’s quote. 9. Croatian: Plan, dodao je Holkeri, za plan-Ncmsn add-Vmp-sm be-Var3s speaker for-Sa cilj ima osigurati da ciljevi goal-Ncmsan have-Vmr3s ensure-Vmn that-Cs goal-Ncmpn standarda koje su odredili UN – standard-Ncmpg which-Pi-mpa be-Var3p set-Vmp-pm UN-Npmpn “učiniti Kosovo boljom sredinom za make-Vmn Kosovo-Npnsa good-Agcfsiy place-Ncfsi for-Sa sve: sigurnom, stabilnom i prosperitetnom” – everyone-Agpnsay safe-Agpnsly stable-Agpmsly and-Cc prosperous-Agpmsly postanu realnost. “QUOTE” become-Vmr3p reality-Ncfsa quote English: The plan, Holkeri said, aims to ensure that the goal of the UN- endorsed standards – “to make Kosovo a better place for everyone: safe, stable and prosperous” – will become a reality. “QUOTE” 5.4 Quotes marking something other than speech As the name would suggest, this problem cluster dealt with quotation marks marking something other than direct speech. These could be quotes marking informal style, quotes around some named entity etc. Quotes marking some- thing other than speech might “confuse” the system or the tool for automatic processing of quotes and it could extract those non-speech instances as speech. Additionally, in some cases it is very hard to judge whether the quotation marks were used as an indication of some kind of metaphor or exaggeration, or if they actually were some speaker’s literal words. This dilemma would especially be im- portant for those dealing with all types of quotes, including mixed quotes, but it is not that crucial for direct quotation processing. Example 10. shows quotation marks which are marking a workshop’s name instead of a quote. Example 11. represents a more ambiguous case. It is unclear whether the quotation marks around ‘približnu istinu’ (en. ‘approximate truth’) are some- one’s literal words (if they are, whose?), an informal reflection of the author’s or the public’s feelings towards this subject or something else. Furthermore, this is 10 Sarajlić et al. an opening statement of the text (perhaps it was the article’s title), and it does not get mentioned again in the text’s content, making the decision even harder. 10. Croatian: Studenti (...) organizirali su radionicu student-Ncmpn organize-Vmp-pm be-Var3p workshop-Ncfsa “Arhitektura, tradicije, sjećanje”. architecture-Ncfsn tradition-Ncfsg memory-Ncnsn quotes marking something other than speech English: Students (...) organised a workshop “Architecture, Traditions, Mem- ory”. 11. Croatian: Novi bi se odbor new-Agpmsny be-Vaa3s self-Px–sa committee-Ncmsn izravno usredotočio na žrtve directly-Rgp focus-Vmp-sm on-Sa victim- Ncfpa dokumentirajući “približnu istinu”. document-Rr approximate-Agpfsay truth-Ncfsa quotes marking something other than speech English: A new panel would focus squarely on the victims in documenting an “approximation of the truth”. 5.5 Quotes from parts of documents or a collective Quotes from a collective are not considered as “real” quotes by the guidelines of Agence France-Presse because their real source is unclear [8]. Such an outlook might not be relevant for automatic processing of quotes because one system or tool might strive for extracting all quotes, no matter if they are from a collective or an individual. However, when an article quotes some document or a part of it, this presumably wouldn’t be acceptable as a quote even for the most inclusive approaches. Therefore both of these types of quotes were marked as potential problems so it could be possible to reflect on them later and decide which ones to consider and treat as quotes, and which ones to ignore. This problem will further be discussed in Subsection 5.12 since it often occurred with the problem described in that subsection. Title Suppressed Due to Excessive Length 11 5.6 Seems to be quoted but no quotes Seems to be quoted, but no quotes refers to sentences or paragraphs that seem like they could be someone’s direct statement, but they are missing quotation marks. When the annotation process first started, it seemed like many documents have such occurrences. Later on, it was realized those sentences were actually source information for the article, and it was simply the reporter’s name and the (presumed) title, subtitle, or introductory sentence of the article. After ignoring those introductory details of articles, only two instances of this problem were found. One such instance can be seen in Example 12., where the quote came after another quote, speaker, and verb-cue, but the quotation marks on the left side of this quote were forgotten. This could lead to the quote being missed in the automatic extraction process. 12. Croatian: Čelnici moraju voditi, a ne leader-Ncmpn must- Vmr3p lead-Vmn and-Cc not-Qz samo pratiti svoje pristaše”. only-Rgp follow-Vmn one’s own-Px-mpa follower-Ncmpn English: Leaders must lead, and not merely follow their followers”. 5.7 Speaker mentioned multiple times differently In news articles, one individual speaker is often referred to in many ways after he was first introduced to the text. In English, this might be done with the sur- name of the speaker, his title (such as “prime minister”) or a pronoun, which then creates a problem because pronoun disambiguation is needed. In Croatian, it is not usual to use pronouns when referring to a person without using their name/title because the predicate form often indicates gender, so pronoun dis- ambiguation was not something crucial to have in mind at this stage. In the rare article that this was the case, the pronoun was annotated as having an anaphoric relationship with the name. Surnames and titles are also often used to refer to the speaker. Creating some sort of connections between surnames/titles and the speaker’s full name will be needed in the future to identify the speaker’s full name for all his or her quotes found in the text. In general, different versions of someone’s name were not annotated as anaphoric (so, surnames were not an- notated as having an anaphoric relationship with the prename or full name). When the speaker of the quote was the speaker’s title, the title and the name were tagged as having an anaphoric relationship. 5.8 Apposition(s) alongside speaker Many times when a source is quoted, his or her title or some other description of who this person is (like the ‘80-year-old fisherman’ in Example 8.) are also 12 Sarajlić et al. mentioned. All of these occurrences were grouped together under the problem cluster Apposition(s) alongside speaker and marked when an apposition of any kind would appear next to the speaker. This problem cluster is not exactly a problem because appositions alongside speakers would not in any way intrude the process of automatic extraction/disambiguation of quotations. On the contrary, they were marked to gain a rough overview of what kinds of expressions could be used instead of the speaker’s name when referring to him/her and how to extract and connect them to the speaker’s name. The potential dataset one could gather from a collection of appositions could be used for resolving issues described in Subsection 5.7. 5.9 Mixed quotes Mixed quotes are quotes that combine direct and indirect quoting styles. While this work is currently focused only on direct quotations, it was strived to anno- tate direct parts of mixed quotes whenever it was judged possible or necessary to do so, most often when the meaning or purpose of the direct part would be clear enough on its own. Example 13. demonstrates a mixed quote - parts of the speaker’s statement were put in quotation marks, and other parts (the speaker expressing condolences) were indirectly quoted. The mixed quote in this exam- ple is even more problematic than others could be because the directly quoted statement is expressed in the 3rd person singular, and not in 1st person singular as one would expect of someone speaking for himself. Based on that, it can be concluded that even this direct part was changed and possibly filtered by the article’s author. 13. Croatian: Rekavši kako se “sjeća, te kako say-Rr how-Cs self-Px–sa remember-Vmr3s and-Cc how-Cs je svjestan dubine (...)” Peres je be-Var3s aware-Agpmsnn depth-Ncfsg speaker be-Var3s izrazio sućut (...) express-Vmp-sm condolence-Ncfsa English: Saying he “remembers and aware of the depth (...),” Peres extended condolences (...) Direct parts of mixed quotes sometimes do not have their own verb-cue because the verb-cue with which they were introduced is not a typical quoting verb. Example 14. portrays this nicely, as ‘izrazio žaljenje’ (en. ‘voiced regret’) is far from what one would consider a typical quoting verb. Often these annotated Title Suppressed Due to Excessive Length 13 direct parts of mixed quotes were annotated only partially - without verb-cue or without the connections in the Quote Co-reference layer. 14. Croatian: (...) Holkeri je takoder izrazio speaker be-Var3s also-Rgp express-Vmp-sm žaljenje jer “nisu sve zajednice” regret-Ncnsa because-Cs be-Var3p all-Agpfpny community-Ncfpn sudjelovale u izradi plana. participate-Vmp-pf in-Sl development-Ncfsl plan-Ncmsg English: (...) Holkeri also voiced regret over the fact that “not every commu- nity” participated in the development of the plan. 5.10 Non-person speaker and Anonymous speaker These two problems were marked down separately, but both concern issues with speaker disambiguation and attribution so they will briefly be described together. Non-person speaker was used to mark those sources which were not persons, but rather collectives or documents. Because it was often found alongside Quotes from parts of documents or a collective, it was used as an addition to that prob- lem cluster. Anonymous speaker problem cluster marked all speakers who were not men- tioned by their name, but rather with an apposition, title etc. Example 15. has a quote whose source is a collective - ‘dvoje čelnika’ (en. ‘two leaders’) - and additionally, they are not named. Future work should decide on how to treat such “speakers” and their quotes, as already mentioned in Subsection 5.5. 15. Croatian: “QUOTE”, kazalo je dvoje čelnika quote say-Vap-sn be-Var3s two-Mls leader-Ncmpg u priopćenju nakon sastanka. in-Sl statement-Ncnsl after-Sg meeting-Ncmsg English: “QUOTE” the two leaders said in a statement after the meeting. 5.11 Cross-branching This phenomenon is also referred to as “discontinuous constituents”5 and occurs when some sentence parts split other sentence parts, such as predicates or noun 5 The term cross-branching is used in [9], while the term discontinuous constituents can be found in [14]. In [15] crossing edges are mentioned. Other terms could also be used to describe this phenomenon. 14 Sarajlić et al. groups, by appearing in the middle of the second sentence part’s construction. In Croatian, it is often observed as a splitting of the predicate. As evident in Example 16., the auxiliary part of the predicate is separated from the participle part by four other tokens (in the example, the notation P means predicate). This is a problem for verb cue classifiers because they would have to recognize such instances and filter out the full predicate from the sen- tence. However, this could be easily solved by automatically adding the auxiliary verb/copula when what seems like a lone participle is found. The gender of the auxiliary verb can be easily deduced from the morphological form of the partici- ple, since parts of a predicate must agree in gender. 16. Croatian: Vojnici su svoje soldier-Ncmpn be-Var3p their-own-Px-nsa aux. part of the P angažiranje u Iraku engagement-Ncnsa in-Sl Iraq-Npmsl opisali kao ”QUOTE” describe-Vmp-pm as-Cs quote participle part of the P English: The soldiers described the engagement in Iraq as “QUOTE” 5.12 Passive Passive in Croatian usually has no agent, meaning that the quotes whose verb- cue is in passive would be without a speaker. The Croatian school grammar states that “passive is used when the agent is unknown or one doesn’t wish to specifically emphasize the agent” [5]. As one could expect, in all of the documents in which Passive was noted, Quotes from parts of documents or a collective was also noted. This continues the problem discussed in Subsection 5.5, i.e. whether those quotes should really be thought of as quotes or not. Sentences with a predicate in passive are also sometimes vague in the sense of the quote’s source, like in Example 17. English: “QUOTE”, it is stated in the statement. 17. Croatian: “QUOTE”, navodi se u priopćenju. quote state-Vmr3s self-Px–sa in- Sl statement-Ncnsl passive 5.13 Sentence parts differently annotated Some of the annotated quotes presented more than one sentiment due to a change of tone or topic. Such sentences were segmented into parts which then received different sentiment annotations. These types of sentences can usually be easily Title Suppressed Due to Excessive Length 15 spotted in Croatian because they use so-called “contrary conjunctions” which link constituent-sentences of contradictory sentiments. Contrary sentences are a type of complex sentences in which all of the clauses can be independent and convey contrary meanings. An example can be seen in Example 18., where ‘ali’ (en. ‘but’) is the contrary conjunction. The part before ‘ali’ was annotated as positive and the rest of the sentence as negative. For easier understanding, the positive clause has been coloured with green, while the negative one was coloured with orange in the example given. 18. Croatian: “Vjernik sam i ponosim se believer-Ncmsn be-Var1s and-Cc proud- Vmr1s self-Px–sa positive time, ali ubijanje nevinih ljudi it-Pd-nsi but-Cc killing-Ncnsn innocent-Agpmpgy people-Ncmpg contrary conjuction ne može nikako biti (...)” no-Qz can-Vmr3s no way-Rgp be-Van negative English: “I am faithful and proud of it, & but killing innocent people can not have (...)” 6 Conclusions In our paper, a detailed presentation of how a Croatian corpus of 140 documents annotated in terms of direct quotation features was created and what procedures were employed is offered. Potential problems for future work in the Quote Fine and Quote Simple layer were recognized and described, along with their exam- ples in Croatian and English. The work described in this paper provides merely a starting point in creating a system or a tool for automatic extraction and attri- bution of quotations in Croatian. In the future, this gold standard data will be used to automatically annotate the remaining un-tagged documents of Croatian SETimes collection and create a silver standard dataset for Croatian. Further- more, the annotations will be projected to the SETimes English parallel corpus and Bulgarian, Bosnian, Greek, Macedonian, Romanian, Albanian, Serbian, and Turkish. 7 Acknowledgements The work presented in this paper has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska- Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross- lingual Event-centric Open Analytics Research Academy). 16 Sarajlić et al. References 1. Balahur, A., Steinberger, R.: Rethinking sentiment analysis in the news: from the- ory to practice and back (2009) 2. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia, M., Pouliquen, B., Belyaeva, J.: Sentiment analysis in the news. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) (2010) 3. Balahur, A., Steinberger, R., Van Der Goot, E., Pouliquen, B., Kabadjov, M.: Opinion mining on newspaper quotations. In: 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. vol. 3, pp. 523–526. IEEE (2009) 4. Fetzer, A., Weizman, E.: ‘what i would say to john and everyone like john is...’: The construction of ordinariness through quotations in mediated political discourse. Discourse & Society 29(5), 495–513 (2018) 5. Hudeček, L., Mihaljević, M.: Hrvatska školska gramatika (2017) 6. Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The incep- tion platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. pp. 5–9. Association for Computational Linguistics (June 2018), http://tubiblio.ulb.tu-darmstadt.de/106270/ 7. Krestel, R., Bergler, S., Witte, R., et al.: Minding the source: Automatic tagging of reported speech in newspaper articles. Reporter 1(5), 4 (2008) 8. de La Clergerie, É., Sagot, B., Stern, R., Denis, P., Recourcé, G., Mignot, V.: Extracting and visualizing quotations from news wires. In: Vetulani, Z. (ed.) Hu- man Language Technology. Challenges for Computer Science and Linguistics. pp. 522–532. Springer Berlin Heidelberg, Berlin, Heidelberg (2011) 9. Lai, P.Y.: The anatomy of translation problems. Chartridge Books Oxford (2013) 10. Muzny, G., Fang, M., Chang, A., Jurafsky, D.: A two-stage sieve approach for quote attribution. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 460–470 (2017) 11. O’Keefe, T., Pareti, S., Curran, J.R., Koprinska, I., Honnibal, M.: A sequence labelling approach to quote attribution. In: Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 790–799 (2012) 12. Pouliquen, B., Steinberger, R., Best, C.: Automatic detection of quotations in multilingual news. In: Proceedings of Recent Advances in Natural Language Pro- cessing. pp. 487–492 (2007) 13. Tyers, F.M., Alperen, M.S.: South-east european times: A parallel corpus of balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages. pp. 49–53 (2010) 14. Van Valin Jr, R.D., et al.: An introduction to syntax. Cambridge University Press (2001) 15. Volk, M., Lundborg, J., Mettler, M.: A search tool for parallel treebanks (2007) 16. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018)