=Paper=
{{Paper
|id=Vol-2836/qurator2021_paper_16
|storemode=property
|title=Quotations, Coreference Resolution, andSentiment Annotations in Croatian NewsArticles:
                        An Exploratory Study
|pdfUrl=https://ceur-ws.org/Vol-2836/qurator2021_paper_16.pdf
|volume=Vol-2836
|authors=Jelena Sarajlić,Gaurish Thakkar,Diego Fernando Válio Antunes Alves,Nives Mikelic Preradovic
|dblpUrl=https://dblp.org/rec/conf/qurator/SarajlicTAP21
}}
==Quotations, Coreference Resolution, andSentiment Annotations in Croatian NewsArticles:
                        An Exploratory Study==
<pdf width="1500px">https://ceur-ws.org/Vol-2836/qurator2021_paper_16.pdf</pdf>
<pre>
        Quotations, Coreference Resolution, and
        Sentiment Annotations in Croatian News
            Articles: An Exploratory Study?

    Jelena Sarajlić1[0000−0003−0986−3972] , Gaurish Thakkar1[0000−0002−8119−5078] ,
               Diego Alves1[0000−0001−8311−2240] , and Nives Mikelic
                            Preradović1[0000−0001−9087−0074]

     Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb 10000,
                                        Croatia
            jelenasarajlic2@gmail.com, {dfvalio, nmikelic}@ffzg.hr,
                                gthakkar@m.ffzg.hr


        Abstract. This paper presents a corpus annotated for the task of direct-
        speech extraction in Croatian.The paper focuses on the annotation of the
        quotation, co-reference resolution, and sentiment annotation in SETimes
        news corpus in Croatian and on the analysis of its language-specific dif-
        ferences compared to English. From this, a list of the phenomena that
        require special attention when performing these annotations is derived.
        The generated corpus with quotation features annotations can be used
        for multiple tasks in the field of Natural Language Processing.

        Keywords: reported-speech · linguistic-phenomenon · resource-creation.


1     Introduction

Quotes are an essential part of news articles and stories made by the media,
individuals, or other organisations. The reproduction of the spoken-text includes
public opinion which expresses personal and subjective information about events
of the world surrounding us. Political analysts and researchers have a major
interest in analysing quotations [4] as it allows a better understanding of the
political dynamics between entities as well as the identification of the correct
source of claims and assertions [16].
    Although recent advances in Machine Learning and Natural Language Pro-
cessing do provide ample support for extracting quotes and identifying the speak-
ers in English texts, very little research has been done for other languages. There-
fore, this paper proposes an original approach for the creation of a corpus of news
texts with quotes annotations for the Croatian language, with a deep analysis
of the corpus genesis process. This news corpus is tagged with quotations, verb-
cues, and speakers’ identification. It also includes co-reference resolution in case
?
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2                                Sarajlić et al.

of pronouns involved in the quotations. Finally, the quotes were tagged in terms
of sentiment (positive, negative, or neutral) of the spoken text at the sentence
level.
    This paper is organized into 6 sections. Section 2 summarizes the related
work and the dataset alongside the annotation process methodology is described
in Section 3. Section 4 presents several statistical information concerning the
generated corpus, followed by Section 5, illustrating the various phenomena that
were encountered. Finally, in Section 6, the main conclusions of this study are
presented.

2     Overview and Related Work
The following sentences are examples of the most common syntactic arrange-
ments of direct quotations that can be found in news articles in English.
 1. Michelle Mcgrath said, “We stand ready to support you in every way”.
 2. “We stand ready to support you in every way” Blair said.
 3. Tony Mcgrath visited Iraq... He said, “We stand ready to support you in
    every way”.
 4. Tony Mcgrath visited Iraq... “We stand ready to support you in every way”
    the Prime Minister said.
 5. “I’m really happy for Fabio” Materazzi told the Apcom news agency Friday.
    “I feel part of this distinction because I think that all the Azzurri helped a
    great champion like Cannavaro win an important prize”.
    The main task of quote extraction is composed of the following sub-tasks [12]:
Spoken-text extraction, Speaker identification, and Verb-cue classifier. Spoken-
text extraction deals with extracting the quote’s content out of the text. Speaker
identification is the task of identifying the correct speaker of the extracted quote
and attributing him/her to the quoted content. Verb-cue classifier must recognize
the verb introducing the quote. This verb is often referred to as the ‘verb-cue’ (for
example, “say”, “state”, and other quoting verbs). Additionally, pronouns, such
as presented in case 3, must be resolved. Hence, a co-reference chain connecting
the antecedent to the pronoun is needed. Furthermore, to enrich the corpus with
subjective information, the text is also tagged at the sentence level in terms of
sentiment (positive, negative, and neutral).
    Previous attempts of automatic extraction of quote-speaker pairs from news
sources have been well-researched for English [12,7]. A sieve-based system in the
literary text along with the dataset “QuoteLi3” was presented by [10]. Sentiment
analysis based on quotations has been extensively studied [2,1,3] on English but
no previous research has been done for Croatian.

3     Corpus
3.1   SETIMES
The SETimes Croatian corpus was used as the basis for the annotation process.
SETimes is a parallel corpus (CC-BY-SA license) [13] based on the contents pub-
                                      Title Suppressed Due to Excessive Length           3

lished on the SETimes.com news portal, concerning “news and views from South-
east Europe” and covering, in total, ten languages: Bulgarian, Bosnian, Greek,
English, Croatian, Macedonian, Romanian, Albanian, Serbian, and Turkish. The
Croatian corpus is composed of 2.7 million words and 197,559 sentences.1


3.2     Data Pre-processing

We merge all the sentences belonging to a single article into one document by
concatenating with a space delimiter. From the whole recomposed corpus, a
sub-corpus of 140 random documents was selected for the annotation task.


3.3     Annotation

INCEpTION tool [6] was chosen for performing all annotations. Three custom
layers were employed:

 – Quote Fine - For tagging the various quotation components: speaker/source,
   verb-cue, and spoken-text. This is a span-based annotation.
 – Quote Simple - For tagging the quoted text in terms of its sentiment. This
   is a span-based annotation which takes multiple sentences and tags them
   separately with the corresponding sentiment, namely positive, negative, and
   neutral.
 – Quote Co-reference - For tagging the 3 different possible types of relations
   identified in quotations.
     • Anaphoric - The connection of a pronoun present either in the quoted
       text and/or as the speaker via a co-reference chain.
     • Uses-verb - Connects the speaker and verb-cue as a chain.
     • Verb-spoken-text - Connects the verb-cue to the spoken-text span as
       a chain.

    The annotations of the Croatian corpus were performed by a Croatian na-
tive speaker. These annotations serve as a preliminary step to try and assess the
possible problems that could arise during the annotation processes and what fea-
tures of quoting in Croatian should be noted when building a tool for automatic
processing of quotes.


4     Dataset Statistics

This section describes the overall statistics of the created dataset. We have a
total of 2497 annotations concerning speaker and quote identification and types
relations. In total, 469 quotes were found in 140 different documents, around 3
quotes per article. In terms of sentiment tagging, we have annotated 875 indi-
vidual sentences.
1
    The corrected version of SETimes corpus where diacritics and encoding system have
    been corrected is available from nlp.ffzg.hr. This version is considered in this paper.
4                                  Sarajlić et al.

                    Annotation Class          No of Annotations
                    Speakers                                446
                    Spoken-text                             469
                    Verb-cue                                468
                    Anaphoric Relation                       13
                    Uses-verb Relation                      431
                    Verb-spoken-text Relation               515
                    Total                                  2497

    Table 1. (Left) Number of annotations in the dataset, except for sentiment tags.


                        Class    Sentences Distribution(%)
                        Positive       284             0.32
                        Neutral        313             0.35
                        Negative       278             0.31
                        Total          875

                    Table 2. Sentiment distribution in the dataset.


5      Specific Croatian Quotation Linguistic Phenomena
Quotes can be categorized into different categories, depending on which feature
one would like to emphasize. In this work, the categorization stated in O’Keefe
et al. [11] was used. This means differentiating between three quote types:
    – Indirect quotes - The quoted content is not inside quotation marks and
      does not follow the speaker’s words precisely, i.e. it is in some way changed
      or paraphrased.
    – Direct quotes - The quoted content in its entirety is inside quotation marks
      and portrays the speaker’s words verbatim.
    – Mixed quotes - The quoted content has both direct and indirect parts,
      meaning that some of the speaker’s words are precisely portrayed, while
      others are paraphrased.
The work described in this paper is focused only on direct quotes. Quote and
quotation are used interchangeably, and both can indicate the whole quote
(speaker+verb-cue+spoken-text) or only the quote content (‘spoken-text’), de-
pending on the context. The ideal quotation would be the one where the speaker,
verb-cue, and spoken-text would be consecutively positioned in a sentence and
not broken apart in any way. Often this is not the case in the corpus (nor real
life), as appositions, time and date details, and other sentence parts are posi-
tioned in between the speaker, verb-cue, and spoken-text. To try and grasp all
potential issues for automatic detection, attribution, and extraction of quotes,
everything that was not deemed as the ideal quotation was written down as
a problem during the manual annotation, separately from the annotations de-
scribed above. This led to many instances being marked as potential problems,
                                  Title Suppressed Due to Excessive Length       5

but by identifying even the smallest deviations from the envisioned standard we
can more easily group problems and try to adapt the automatic processing to
most commonly found issues. Croatian is a South Slavic language and therefore
has some distinct linguistic features when compared to English. Some problems
encountered in English automatic quotation processing, such as pronoun disam-
biguation, do not concern Croatian (at least not to the same extent), and vice-
versa - instances which do not exist in English pose a problem in Croatian. In
the following subsections, we will briefly describe all of the problems encountered
during the annotation process, with a more detailed focus on language-specific
ones, except for one problem because it is strictly a problem with the INCEp-
TION tool.
Altogether, 16 problem clusters were recognized – 15 in the Quote Fine layer and
one in the Quote Simple layer. All of the problems are listed in Table 3, along
with the percentage of documents2 in which they were noticed. It is worth men-
tioning that this is merely an indicator of how many documents had a certain
problem, and not how prevalent a problem might be inside a certain document
(i.e. in one document the problem might arise 10 times, and only once in another
document). Still, it was assessed that this could be useful as a rough estimate of
problem distribution throughout the annotated documents.
All of these phenomena were named by the authors for the purpose of this work
(except for cross-branching and passive) and are not official names for such occur-
rences. The tags in examples were generated by the ReLDI tagger3 for Croatian.


5.1   Indirectly correlated speaker and Indirectly correlated verb-cue

Indirectly correlated speaker and Indirectly correlated verb-cue were noted as two
different problems as the former deals only with the speaker, and the latter only
with verb-cue issues. However, they will be described together as they essentially
represent the same issue, i.e. one of the other two constituents of a quote being
separated from the spoken-text. This could create difficulties with automatic
extraction because the speaker and verb-cue can in some cases be far away from
the spoken-text. This can happen with the insertion of time and date details,
appositions, or other sentence parts between the speaker, verb-cue, and spoken-
text (the quoted content itself). While a human reader probably would not have
any problems connecting the correct speaker or verb-cue to its spoken-text, the
presumption was that this could pose a problem for the process of automatic
extraction and disambiguation of quotes. While it would be possible to filter
out the most common words and expressions separating the two, such as well-
known appositions or time/location expressions, some less frequent instances
could create a problem. This is why it was decided to note every case where the
2
  This percentage was calculated with dividing the number of documents containing
  a certain problem with the number of documents which had direct quotes (120
  documents), since only those documents were actually annotated.
3
  ReLDI tagger, available on http://www.clarin.si/services/web/query
6                                Sarajlić et al.

      Case                                                 % of documents
      Quote on multiple pages                              [tool problem]
      Indirectly correlated speaker                        61.67
      Indirectly correlated verb cue                       12.50
      Speaker or verb-cue in the middle of the quoted text 41.67
      Quotes without an apparent speaker or verb-cue       11.67
      Quotes marking something other than speech           33.33
      Quotes from parts of documents or a collective       27.50
      Seems to be quoted but no quotes                     1.67
      Speaker mentioned multiple times differently         63.33
      Apposition(s) alongside speaker                      78.33
      Mixed quotes                                         54.17
      Non-person speaker                                   9.17
      Anonymous speaker                                    15.83
      Cross-branching                                      5
      Passive                                              18.33
      Sentence parts differently annotated                 32.50

Table 3. Problem clusters and the percentage of documents in which they were noticed.


speaker or verb-cue were separated from the spoken-text, with no regard to if
there was only one word separating them or whole paragraphs. This can offer
an insight into how this separation happens (in our dataset) and how it could
be tackled.
Example 6. demonstrates an occurrence of Indirectly correlated speaker. ‘Sarah
Lum’ is the speaker whose words were quoted in the following sentence of the
document, meaning that there are five tokens separating the speaker from her
quote and the verb-cue. In this case, there is an apposition with additional details
about the institution and its location (‘predstavnica Američkog ureda u Prištini’,
en. ‘representing the US Office in Pristina’) between the speaker and the quote.


        6. Croatian:

           Sličnu             izjavu             dala       je    Sarah Lum,
      similar-Agpfsay     statement-Ncfsa     give-Vmp-sf be-Var3s speaker

        predstavnica        Američkog        ureda           u        Prištini.
    representative-Ncfsn American-Agpmsgy office-Ncmsg      in-Sl   Pristina-Npfsl

        “QUOTE”,              kazala      je.
          quote             say-Vmp-sf be-Var3s
                                verb-cue
                                     Title Suppressed Due to Excessive Length             7

   English: Similar comments came from Sarah Lum, representing the US office
in Pristina. “QUOTE,” she said.

    In Example 7., a case of Indirectly correlated verb-cue in a mixed quote can be
seen. Verb-cue ‘izjavio je’ (en. ‘has stated’) is separated from the direct part of the
mixed quote by the indirect part of the quote. Mixed quotes were not the focus of
this work, but the direct part of a mixed quote was nevertheless annotated along
with its speaker and verb-cue when judged possible or needed. It is interesting
to note that there were no annotated documents which had Indirectly correlated
verb-cue where there was not also an Indirectly correlated speaker 4 , even though
there were many documents with instances of Indirectly correlated speaker, but
no Indirectly correlated verb-cue. Since these observations are made only on the
document, and not on the quote-level, more detailed research would be needed
to establish whether this is merely a coincidence or not.


    7. Croatian:

    “QUOTE”      diplomatski               su           ciljevi    Ankare
      quote diplomatic-Agpmpny          be-Var3p     goals-Ncmpn Ankara-Npfsg

        na           Balkanu,             izjavio     je           Gürkan Zengin (...)
       on-Sl       Balkan-Npmsl       state-Vmp-sm be-Var3s           speaker
                                             verb-cue


English: “QUOTE” are Ankara’s Balkan diplomacy goals, according to
Gürkan Zengin (...)


5.2     Speaker and/or verb-cue in the middle of the quoted text

This problem cluster grouped together all cases where one quote’s content was
separated by other text in which both the speaker and/or the verb-cue occurred.
This type of quoting could be seen in Example 5. in Section 2. This meant that a
quote that would be understood by a human reader as one individual quote was
separated into multiple spoken-texts. This could create problems with automatic
extraction of quotes because it could happen that only a part of the quote’s
content would be extracted with the speaker and verb-cue, while the other part
could be ignored, extracted without speaker/verb-cue, or with them wrongly
attributed. Example 8. demonstrates a quote from one Chernomorets resident,
whose statement was separated by a short text with time (‘krajem rujna’, en.
‘end of September’) and source details (‘za SETimes’, en. ‘for SETimes’), but also
mentions speaker and verb-cue. Tags ‘Quote1.1’ and ‘Quote1.2’ in the example
4
    There was one exception, but the Indirectly correlated verb-cue occurred in a passive
    clause with no agent, so there was no speaker in the first place.
8                               Sarajlić et al.

tag the first and the second part of his quote, respectively. After the second
part of the quote, the article in question continues with its topic with no further
reference to the fisherman or his words. Since there are no clear indications about
speaker and verb-cue for the second part of the quote it could, in theory, easily
be mistakenly extracted as a quote with no speaker/verb-cue (or simply ignored
because of “lack” of these constituents) or some other speaker and verb-cue could
falsely be attributed to it. This is why it was thought important to note these
cases so some data about how to most successfully try and add separated quote
contents to each other in later steps of our work was available.


         8. Croatian:

        “QUOTE1.1”,               rekao           je                   krajem
    part 1/2 of the quote      say-Vmp-sm      be-Var3s                end-Sg
                                      verb-cue

           rujna                     za               SETimes    Angel Kishev
      September-Ncmsg              for-Sa          SETimes-Npmsan speaker

          80-godišnji             ribar                  iz           grada
     80-year- old-Agpmsny    fisherman-Ncmsn           from-Sg      town-Ncmsg

       Chernomoretsa.        “QUOTE1.2”
     Chernomorets-Npmsg part 2/2 of the quote


   English: “QUOTE1.1,” Angel Kishev, an 80-year-old fisherman from the
town of Chernomorets, told SETimes in late September. “QUOTE2.2”


5.3     Quotes without an apparent speaker or verb-cue

Problem cluster Quotes without an apparent speaker or verb-cue deals with
those quotes whose speaker and/or verb-cue would be apparent to a human
reader, but not necessarily to a tool for automatic quotations extraction. This
could lead to quotes which are part of this cluster to be ignored or wrongly ex-
tracted/attributed. This was already partially discussed in Subsection 5.2, where
it was described how some parts of the separated quote could fall into this cat-
egory. However, there are other cases in which this problem arises. Example 9.
shows a sentence before the problematic quote. After the quote, the article con-
tinues with no additional information about its speaker or verb-cue. The sentence
shown in the example contains a quote from, presumably, a different source - it
seems to list the UN’s endorsed standards. The real quote from Holkeri follows
after that sentence. The system or the tool for automatic processing of quotes
would first have to recognize that the inserted quote is not Holkeri’s quote, but
rather a part of some document. The next step would be to attribute Holkeri’s
                                     Title Suppressed Due to Excessive Length          9

actual quote to him, even though it does not have a specific verb-cue, is not
in the same sentence as the speaker, and one has to infer that this indeed is
Holkeri’s quote.


      9. Croatian:

         Plan,          dodao             je         Holkeri,             za
      plan-Ncmsn     add-Vmp-sm        be-Var3s      speaker            for-Sa

           cilj           ima           osigurati       da              ciljevi
      goal-Ncmsan     have-Vmr3s      ensure-Vmn      that-Cs        goal-Ncmpn

    standarda       koje                  su           odredili        UN –
 standard-Ncmpg which-Pi-mpa           be-Var3p     set-Vmp-pm       UN-Npmpn

       “učiniti        Kosovo       boljom          sredinom             za
      make-Vmn       Kosovo-Npnsa good-Agcfsiy      place-Ncfsi         for-Sa

        sve:       sigurnom,      stabilnom              i         prosperitetnom” –
 everyone-Agpnsay safe-Agpnsly stable-Agpmsly         and-Cc      prosperous-Agpmsly

     postanu           realnost.      “QUOTE”
  become-Vmr3p       reality-Ncfsa      quote


   English: The plan, Holkeri said, aims to ensure that the goal of the UN-
endorsed standards – “to make Kosovo a better place for everyone: safe, stable
and prosperous” – will become a reality. “QUOTE”

5.4     Quotes marking something other than speech
As the name would suggest, this problem cluster dealt with quotation marks
marking something other than direct speech. These could be quotes marking
informal style, quotes around some named entity etc. Quotes marking some-
thing other than speech might “confuse” the system or the tool for automatic
processing of quotes and it could extract those non-speech instances as speech.
Additionally, in some cases it is very hard to judge whether the quotation marks
were used as an indication of some kind of metaphor or exaggeration, or if they
actually were some speaker’s literal words. This dilemma would especially be im-
portant for those dealing with all types of quotes, including mixed quotes, but it
is not that crucial for direct quotation processing. Example 10. shows quotation
marks which are marking a workshop’s name instead of a quote.
    Example 11. represents a more ambiguous case. It is unclear whether the
quotation marks around ‘približnu istinu’ (en. ‘approximate truth’) are some-
one’s literal words (if they are, whose?), an informal reflection of the author’s or
the public’s feelings towards this subject or something else. Furthermore, this is
10                                     Sarajlić et al.

an opening statement of the text (perhaps it was the article’s title), and it does
not get mentioned again in the text’s content, making the decision even harder.


        10. Croatian:

          Studenti             (...)          organizirali    su       radionicu
       student-Ncmpn                       organize-Vmp-pm be-Var3p workshop-Ncfsa

        “Arhitektura,       tradicije,     sjećanje”.
      architecture-Ncfsn tradition-Ncfsg memory-Ncnsn
      quotes marking something other than speech


   English: Students (...) organised a workshop “Architecture, Traditions, Mem-
ory”.


 11. Croatian:

     Novi                    bi                                se          odbor
 new-Agpmsny              be-Vaa3s                        self-Px–sa   committee-Ncmsn

        izravno           usredotočio                       na               žrtve
     directly-Rgp       focus-Vmp-sm                        on-Sa        victim- Ncfpa

dokumentirajući     “približnu            istinu”.
 document-Rr approximate-Agpfsay          truth-Ncfsa
                 quotes marking something other than speech


   English: A new panel would focus squarely on the victims in documenting
an “approximation of the truth”.


5.5      Quotes from parts of documents or a collective

Quotes from a collective are not considered as “real” quotes by the guidelines of
Agence France-Presse because their real source is unclear [8]. Such an outlook
might not be relevant for automatic processing of quotes because one system or
tool might strive for extracting all quotes, no matter if they are from a collective
or an individual. However, when an article quotes some document or a part of it,
this presumably wouldn’t be acceptable as a quote even for the most inclusive
approaches. Therefore both of these types of quotes were marked as potential
problems so it could be possible to reflect on them later and decide which ones
to consider and treat as quotes, and which ones to ignore. This problem will
further be discussed in Subsection 5.12 since it often occurred with the problem
described in that subsection.
                                    Title Suppressed Due to Excessive Length    11

5.6     Seems to be quoted but no quotes
Seems to be quoted, but no quotes refers to sentences or paragraphs that seem
like they could be someone’s direct statement, but they are missing quotation
marks. When the annotation process first started, it seemed like many documents
have such occurrences. Later on, it was realized those sentences were actually
source information for the article, and it was simply the reporter’s name and the
(presumed) title, subtitle, or introductory sentence of the article. After ignoring
those introductory details of articles, only two instances of this problem were
found. One such instance can be seen in Example 12., where the quote came
after another quote, speaker, and verb-cue, but the quotation marks on the left
side of this quote were forgotten. This could lead to the quote being missed in
the automatic extraction process.


         12. Croatian:

             Čelnici   moraju            voditi,           a            ne
         leader-Ncmpn must- Vmr3p       lead-Vmn          and-Cc       not-Qz

             samo           pratiti        svoje         pristaše”.
           only-Rgp      follow-Vmn one’s own-Px-mpa follower-Ncmpn
      English: Leaders must lead, and not merely follow their followers”.

5.7     Speaker mentioned multiple times differently
In news articles, one individual speaker is often referred to in many ways after
he was first introduced to the text. In English, this might be done with the sur-
name of the speaker, his title (such as “prime minister”) or a pronoun, which
then creates a problem because pronoun disambiguation is needed. In Croatian,
it is not usual to use pronouns when referring to a person without using their
name/title because the predicate form often indicates gender, so pronoun dis-
ambiguation was not something crucial to have in mind at this stage. In the rare
article that this was the case, the pronoun was annotated as having an anaphoric
relationship with the name. Surnames and titles are also often used to refer to
the speaker. Creating some sort of connections between surnames/titles and the
speaker’s full name will be needed in the future to identify the speaker’s full
name for all his or her quotes found in the text. In general, different versions
of someone’s name were not annotated as anaphoric (so, surnames were not an-
notated as having an anaphoric relationship with the prename or full name).
When the speaker of the quote was the speaker’s title, the title and the name
were tagged as having an anaphoric relationship.

5.8     Apposition(s) alongside speaker
Many times when a source is quoted, his or her title or some other description
of who this person is (like the ‘80-year-old fisherman’ in Example 8.) are also
12                                Sarajlić et al.

mentioned. All of these occurrences were grouped together under the problem
cluster Apposition(s) alongside speaker and marked when an apposition of any
kind would appear next to the speaker. This problem cluster is not exactly a
problem because appositions alongside speakers would not in any way intrude the
process of automatic extraction/disambiguation of quotations. On the contrary,
they were marked to gain a rough overview of what kinds of expressions could be
used instead of the speaker’s name when referring to him/her and how to extract
and connect them to the speaker’s name. The potential dataset one could gather
from a collection of appositions could be used for resolving issues described in
Subsection 5.7.


5.9     Mixed quotes

Mixed quotes are quotes that combine direct and indirect quoting styles. While
this work is currently focused only on direct quotations, it was strived to anno-
tate direct parts of mixed quotes whenever it was judged possible or necessary
to do so, most often when the meaning or purpose of the direct part would be
clear enough on its own. Example 13. demonstrates a mixed quote - parts of the
speaker’s statement were put in quotation marks, and other parts (the speaker
expressing condolences) were indirectly quoted. The mixed quote in this exam-
ple is even more problematic than others could be because the directly quoted
statement is expressed in the 3rd person singular, and not in 1st person singular
as one would expect of someone speaking for himself. Based on that, it can be
concluded that even this direct part was changed and possibly filtered by the
article’s author.


     13. Croatian:
        Rekavši         kako             se        “sjeća,     te    kako
        say-Rr          how-Cs       self-Px–sa remember-Vmr3s and-Cc how-Cs

          je            svjestan     dubine           (...)”      Peres     je
       be-Var3s      aware-Agpmsnn depth-Ncfsg                   speaker be-Var3s

      izrazio        sućut              (...)
 express-Vmp-sm condolence-Ncfsa


   English: Saying he “remembers and aware of the depth (...),” Peres extended
condolences (...)

Direct parts of mixed quotes sometimes do not have their own verb-cue because
the verb-cue with which they were introduced is not a typical quoting verb.
Example 14. portrays this nicely, as ‘izrazio žaljenje’ (en. ‘voiced regret’) is far
from what one would consider a typical quoting verb. Often these annotated
                                        Title Suppressed Due to Excessive Length       13

direct parts of mixed quotes were annotated only partially - without verb-cue or
without the connections in the Quote Co-reference layer.
       14. Croatian:

           (...)           Holkeri            je           takoder         izrazio
                          speaker          be-Var3s       also-Rgp    express-Vmp-sm

          žaljenje          jer             “nisu           sve       zajednice”
       regret-Ncnsa      because-Cs        be-Var3p     all-Agpfpny community-Ncfpn

        sudjelovale           u             izradi         plana.
    participate-Vmp-pf      in-Sl     development-Ncfsl plan-Ncmsg


    English: (...) Holkeri also voiced regret over the fact that “not every commu-
nity” participated in the development of the plan.
5.10 Non-person speaker and Anonymous speaker
These two problems were marked down separately, but both concern issues with
speaker disambiguation and attribution so they will briefly be described together.
Non-person speaker was used to mark those sources which were not persons, but
rather collectives or documents. Because it was often found alongside Quotes
from parts of documents or a collective, it was used as an addition to that prob-
lem cluster.
Anonymous speaker problem cluster marked all speakers who were not men-
tioned by their name, but rather with an apposition, title etc. Example 15. has
a quote whose source is a collective - ‘dvoje čelnika’ (en. ‘two leaders’) - and
additionally, they are not named. Future work should decide on how to treat
such “speakers” and their quotes, as already mentioned in Subsection 5.5.

         15. Croatian:

          “QUOTE”,           kazalo           je        dvoje            čelnika
            quote          say-Vap-sn      be-Var3s    two-Mls       leader-Ncmpg


               u            priopćenju   nakon     sastanka.
             in-Sl       statement-Ncnsl after-Sg meeting-Ncmsg

     English: “QUOTE” the two leaders said in a statement after the meeting.

5.11     Cross-branching
This phenomenon is also referred to as “discontinuous constituents”5 and occurs
when some sentence parts split other sentence parts, such as predicates or noun
5
    The term cross-branching is used in [9], while the term discontinuous constituents
    can be found in [14]. In [15] crossing edges are mentioned. Other terms could also
    be used to describe this phenomenon.
14                               Sarajlić et al.

groups, by appearing in the middle of the second sentence part’s construction.
In Croatian, it is often observed as a splitting of the predicate.
    As evident in Example 16., the auxiliary part of the predicate is separated
from the participle part by four other tokens (in the example, the notation P
means predicate). This is a problem for verb cue classifiers because they would
have to recognize such instances and filter out the full predicate from the sen-
tence. However, this could be easily solved by automatically adding the auxiliary
verb/copula when what seems like a lone participle is found. The gender of the
auxiliary verb can be easily deduced from the morphological form of the partici-
ple, since parts of a predicate must agree in gender.

                16. Croatian:

                   Vojnici                   su                svoje
               soldier-Ncmpn              be-Var3p      their-own-Px-nsa
                                     aux. part of the P

                angažiranje                    u               Iraku
             engagement-Ncnsa                 in-Sl          Iraq-Npmsl

                    opisali                    kao           ”QUOTE”
              describe-Vmp-pm                 as-Cs            quote
          participle part of the P
     English: The soldiers described the engagement in Iraq as “QUOTE”

5.12    Passive
Passive in Croatian usually has no agent, meaning that the quotes whose verb-
cue is in passive would be without a speaker. The Croatian school grammar
states that “passive is used when the agent is unknown or one doesn’t wish to
specifically emphasize the agent” [5]. As one could expect, in all of the documents
in which Passive was noted, Quotes from parts of documents or a collective was
also noted. This continues the problem discussed in Subsection 5.5, i.e. whether
those quotes should really be thought of as quotes or not. Sentences with a
predicate in passive are also sometimes vague in the sense of the quote’s source,
like in Example 17. English: “QUOTE”, it is stated in the statement.

             17. Croatian:
              “QUOTE”,        navodi        se      u       priopćenju.
                 quote     state-Vmr3s self-Px–sa in- Sl statement-Ncnsl
                                passive


5.13    Sentence parts differently annotated
Some of the annotated quotes presented more than one sentiment due to a change
of tone or topic. Such sentences were segmented into parts which then received
different sentiment annotations. These types of sentences can usually be easily
                                  Title Suppressed Due to Excessive Length            15

spotted in Croatian because they use so-called “contrary conjunctions” which
link constituent-sentences of contradictory sentiments. Contrary sentences are a
type of complex sentences in which all of the clauses can be independent and
convey contrary meanings. An example can be seen in Example 18., where ‘ali’
(en. ‘but’) is the contrary conjunction. The part before ‘ali’ was annotated as
positive and the rest of the sentence as negative. For easier understanding, the
positive clause has been coloured with green, while the negative one was coloured
with orange in the example given.


 18. Croatian:
    “Vjernik             sam             i              ponosim              se
believer-Ncmsn         be-Var1s       and-Cc         proud- Vmr1s       self-Px–sa
                                   positive

      time,               ali           ubijanje        nevinih         ljudi
    it-Pd-nsi           but-Cc       killing-Ncnsn innocent-Agpmpgy people-Ncmpg
                 contrary conjuction

      ne                 može        nikako              biti               (...)”
     no-Qz            can-Vmr3s     no way-Rgp           be-Van
                                  negative

    English: “I am faithful and proud of it, & but killing innocent people can
    not have (...)”

6    Conclusions
In our paper, a detailed presentation of how a Croatian corpus of 140 documents
annotated in terms of direct quotation features was created and what procedures
were employed is offered. Potential problems for future work in the Quote Fine
and Quote Simple layer were recognized and described, along with their exam-
ples in Croatian and English. The work described in this paper provides merely
a starting point in creating a system or a tool for automatic extraction and attri-
bution of quotations in Croatian. In the future, this gold standard data will be
used to automatically annotate the remaining un-tagged documents of Croatian
SETimes collection and create a silver standard dataset for Croatian. Further-
more, the annotations will be projected to the SETimes English parallel corpus
and Bulgarian, Bosnian, Greek, Macedonian, Romanian, Albanian, Serbian, and
Turkish.

7    Acknowledgements
The work presented in this paper has received funding from the European
Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-
Curie grant agreement no. 812997 and under the name CLEOPATRA (Cross-
lingual Event-centric Open Analytics Research Academy).
16                                 Sarajlić et al.

References
 1. Balahur, A., Steinberger, R.: Rethinking sentiment analysis in the news: from the-
    ory to practice and back (2009)
 2. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia,
    M., Pouliquen, B., Belyaeva, J.: Sentiment analysis in the news. In: Proceedings
    of the Seventh International Conference on Language Resources and Evaluation
    (LREC’10) (2010)
 3. Balahur, A., Steinberger, R., Van Der Goot, E., Pouliquen, B., Kabadjov, M.:
    Opinion mining on newspaper quotations. In: 2009 IEEE/WIC/ACM International
    Joint Conference on Web Intelligence and Intelligent Agent Technology. vol. 3, pp.
    523–526. IEEE (2009)
 4. Fetzer, A., Weizman, E.: ‘what i would say to john and everyone like john is...’: The
    construction of ordinariness through quotations in mediated political discourse.
    Discourse & Society 29(5), 495–513 (2018)
 5. Hudeček, L., Mihaljević, M.: Hrvatska školska gramatika (2017)
 6. Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The incep-
    tion platform: Machine-assisted and knowledge-oriented interactive annotation. In:
    Proceedings of the 27th International Conference on Computational Linguistics:
    System Demonstrations. pp. 5–9. Association for Computational Linguistics (June
    2018), http://tubiblio.ulb.tu-darmstadt.de/106270/
 7. Krestel, R., Bergler, S., Witte, R., et al.: Minding the source: Automatic tagging
    of reported speech in newspaper articles. Reporter 1(5), 4 (2008)
 8. de La Clergerie, É., Sagot, B., Stern, R., Denis, P., Recourcé, G., Mignot, V.:
    Extracting and visualizing quotations from news wires. In: Vetulani, Z. (ed.) Hu-
    man Language Technology. Challenges for Computer Science and Linguistics. pp.
    522–532. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
 9. Lai, P.Y.: The anatomy of translation problems. Chartridge Books Oxford (2013)
10. Muzny, G., Fang, M., Chang, A., Jurafsky, D.: A two-stage sieve approach for quote
    attribution. In: Proceedings of the 15th Conference of the European Chapter of the
    Association for Computational Linguistics: Volume 1, Long Papers. pp. 460–470
    (2017)
11. O’Keefe, T., Pareti, S., Curran, J.R., Koprinska, I., Honnibal, M.: A sequence
    labelling approach to quote attribution. In: Proceedings of the 2012 Joint Confer-
    ence on Empirical Methods in Natural Language Processing and Computational
    Natural Language Learning. pp. 790–799 (2012)
12. Pouliquen, B., Steinberger, R., Best, C.: Automatic detection of quotations in
    multilingual news. In: Proceedings of Recent Advances in Natural Language Pro-
    cessing. pp. 487–492 (2007)
13. Tyers, F.M., Alperen, M.S.: South-east european times: A parallel corpus of balkan
    languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual
    Resources and Tools for Central and (South-) Eastern European Languages. pp.
    49–53 (2010)
14. Van Valin Jr, R.D., et al.: An introduction to syntax. Cambridge University Press
    (2001)
15. Volk, M., Lundborg, J., Mettler, M.: A search tool for parallel treebanks (2007)
16. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science
    359(6380), 1146–1151 (2018)

</pre>