<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CROATPAS: A Resource of Corpus-derived Typed Predicate Argument Structures for Croatian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Costanza Marini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Ježek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pavia, Department of Humanities</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>The goal of this paper is to introduce CROATPAS, the Croatian sister project of the Italian Typed-Predicate Argument Structure resource (TPAS1, Ježek et al. 2014). CROATPAS is a corpus-based digital collection of verb valency structures with the addition of semantic type specifications (SemTypes) to each argument slot, which is currently being developed at the University of Pavia. Salient verbal patterns are discovered following a lexicographical methodology called Corpus Pattern Analysis (CPA, Hanks 2004 &amp; 2012; Hanks &amp; Pustejovsky 2005; Hanks et al. 2015), whereas SemTypes - such as [HUMAN], [ENTITY] or [ANIMAL] - are taken from a shallow ontology shared by both TPAS and the Pattern Dictionary of English Verbs (PDEV2, Hanks &amp; Pustejovsky 2005; El Maarouf et al. 2014). The theoretical framework the resource relies on is Pustejovsky's Generative Lexicon theory (1995 &amp; 1998; Pustejovsky &amp; Ježek 2008), in light of which verbal polysemy and metonymic argument shifts can be traced back to compositional operations involving the variation of the SemTypes associated to the valency structure of each verb. The corpus used to identify verb patterns in CROATPAS is the Croatian Web as Corpus (hrWac 2.2, RELDI PoS-tagged) (Ljubešić &amp; Erjavec 2011), which contains 1.2 billion types and is available on the Sketch Engine3 (Kilgarriff et al.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        Nowadays, we live in a time when digital tools
and resources for language technology are
constantly mushrooming all around the world.
However, we should remind ourselves that some
languages need our attention more than others if
they are not to face – to put it in Rehm and
Hegelesevere’s words – “a steadily increasing
and rather severe threat of di
        <xref ref-type="bibr" rid="ref17">gital extinction”
(2018</xref>
        : 3282).
      </p>
      <p>
        According to the findings of initiatives such as
the META-NET White Paper
        <xref ref-type="bibr" rid="ref4">Series (Tadić et al.
2012</xref>
        ; R
        <xref ref-type="bibr" rid="ref5">ehm et al. 2014</xref>
        ), we can state that
Croatian is unfortunately among the 21 out of 24
official languages of the European Union that are
currently considered under-resourced. As a
matter of fact, Croatian “tools and resources for
[…] deep parsing, machine translation, text
semantics, discourse processing, language
generation, dialogue management simply do not
exi
        <xref ref-type="bibr" rid="ref4">st” (Tadić et al. 2012</xref>
        : 77). An observation
that is only strengthened by the update study
carri
        <xref ref-type="bibr" rid="ref5">ed out by Rehm et al. (2014</xref>
        ), which shows
that, in comparison with other European
languages, Croatian has weak to no support as
far as text analytics technologies go and only
fragmentary support when talking of resources
such as corpora, lexical resources and grammars.
In this framework, a semantic resource such as
CROATPAS could play its part not only in NLP,
(e.g. multilingual pattern linking between other
existing compatible resources), but also in
automatic machine translation, c
        <xref ref-type="bibr" rid="ref12">omputer-assisted
Copyright © 2019</xref>
        for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
language learning (CALL) and theoretical and
applied cross-linguistic studies.
      </p>
      <p>The paper is structured as follows: first a detailed
overview of the resource is presented (Section 2),
followed by its theoretical underpinnings
(Section 3) and a summary of the
Croatianspecific challenges we faced while building the
resource editor (Section 4). An overview of the
existing related works is given in Section 5.
Finally, Section 6 hints at the creation of a
multilingual resource linking CROATPAS,
TPAS (Italian) and PDEV (English) patterns and
explores CROATPAS’s potential for
computerassisted L2 teaching and learning.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Resource overview</title>
      <p>
        CROATPAS, i.e. the Croatian Typed-Predicate
Argument Structure resource, is the Croatian
equivalent of the Italian TPAS r
        <xref ref-type="bibr" rid="ref5">esource (Ježek et
al. 2014</xref>
        ) and is a corpus-derived collection of
Croatian verb argument structures, whose
argument slots have been annotated using
semantic type specifications (SemTypes).
The first version of the resource is currently
being developed at the University of Pavia with
the technical assistance of Lexical Computing
Ltd. in the person of Vìt Baisa and will be
released in 2020 through an Open Access
graphical user interface on the website of the
Language Centre of the University of Pavia
(CLA)4.
      </p>
      <p>CROATPAS contains a sample of 100
mediumfrequency Croatian verbs, whose Italian
translational counterparts are already available in
the TPAS resource: 26 of these verbs are
Croatian translational equivalents of Italian
“coercive verbs”, i.e. verbs that instantiate
metonymic shifts in one of their senses (Ježek &amp;
Quochi 2010), while the remaining 74 are
Croatian translational equivalents of a sample of
Italian fundamental verbs, i.e. verbs belonging to
that group of approximately 2000 lexemes
deemed essential for communicating in Italian
and that can be found in any sort of text (De
Mauro 2016).</p>
      <p>
        Our 74-verbs sample was selected as follows: we
first extracted the frequency counts for all the
452 fundamental verbs on De Mauro’s list from a
reduced version of the ItWAC (Baroni &amp;
Kilgarriff, 2006), which contains over 900
million tokens and is available on the Sk
        <xref ref-type="bibr" rid="ref5">etch
Engine (Kilgarriff et al. 2014</xref>
        ). We then selected
4 https://cla.unipv.it/?page_id=53723 (last visited on July
12th 2019)
our 74 Italian candidates around the median
frequency value after taking out the first and the
last 20 verbs on the list. Finally, the Croatian
translational equivalents for these verbs were
chosen using the 2017 Zanichelli Italian/Croatian
bilingual dictionary Croato compatto, edited by
Aleksandra Špikić.
      </p>
      <p>
        The theoretical framework the resource relies on
is Pustejovsky’s Generative Lexicon theory
        <xref ref-type="bibr" rid="ref16">(1995 &amp; 1998; Pustejovsky &amp; Ježek 2008)</xref>
        , in
light of which verbal polysemy and metonymic
shifts can be traced back to compositional
operations involving the contextual variation of
the SemTypes associated to the valency structure
of each verb.
      </p>
      <p>CROATPAS rests on four key-components,
namely:
1) a representative corpus of Croatian;
2) a shallow ontology of SemTypes;
3) a methodology for corpus analysis;
4) adequate corpus tools.</p>
      <p>
        As for the first component, the corpus used to
identify verb patterns is the Croatian Web as
Corpus (hrWac 2.2, RELDI PoS-tagged)
        <xref ref-type="bibr" rid="ref11">(Ljubešić &amp; Erjavec, 2011)</xref>
        , containing 1.2
billion types and available on the Sk
        <xref ref-type="bibr" rid="ref5">etch Engine
(Kilgarriff et al. 2014</xref>
        ). We chose to work with
the Croatian Web as Corpus since the reference
corpus for the Italian TPAS resource is a reduced
version of the Italian Web as Corpus (Baroni &amp;
Kilgarriff, 2006), so as to make the two resources
as comparable as possible.
      </p>
      <p>
        As for the shallow ontology of Semantic Type
labels, CROATPAS is based on the same
hierarchy shared by TPAS and the PDEV project
of 180 SemTypes, which originates from the
Brandeis Shallow Ontology (BSO)
        <xref ref-type="bibr" rid="ref15">(Pustejovsky
et al. 2004)</xref>
        and its initial 65 lab
        <xref ref-type="bibr" rid="ref5">els. As pointed
out by Ježek (2014</xref>
        : 890), SemTypes “are not
abstract categories but semantic classes
discovered by generalizing over the statistically
relevant list of collocates that fill each position”.
For example, the Croatian lexical set for the
SemType [BEVERAGE] in the context of the verb
pair PITI/POPITI (= TO DRINK, imperfective/perfective)
contains, among others: {vodu = water, kavu =
coffee, koktel = cocktail, vino = wine, čaj = tea,
pivo = beer, limonadu = lemonade}, as shown in
the following pattern string from the resource.
The corpus analysis methodology used for both
TPAS and CROATPAS is a lexicographical
methodology called Corpus Pattern Analysis
        <xref ref-type="bibr" rid="ref15">(CPA, Hanks 2004 &amp; 2012; Hanks &amp;
Pustejovsky 2005; Hanks et al. 2015)</xref>
        , which is
based on the Theory of Norms and Exploitations
        <xref ref-type="bibr" rid="ref15">(TNE, Hanks 2004, 2013)</xref>
        . TNE divides word
uses in two main classes: conventional uses
(norms) and deviations from the norms
(exploitations). CPA’s potential lies in that it
does not try to identify meaning in isolation, but
rather associates it with prototypical contexts,
thus focusing on the norms. The standard CPA
procedure requires:
1) sampling concordances for each verb
2) identifying its typical patterns – i.e.
senses – while going through the corpus
lines
3) assigning SemTypes to the argument
slots in each pattern
4) assigning the sampled concordance lines
to the identified patterns
This last operation is possible because both the
TPAS and CROATPAS editors are linked to
their respective language-specific corpora
through the Sk
        <xref ref-type="bibr" rid="ref5">etch Engine (Kilgarriff et al.
2014</xref>
        ), which proves once again to be the perfect
tool for lexicographic work.
      </p>
      <p>
        The resource will be evaluated through IAA on
pattern identification for a sub-sample of the
verb inventory, following the methodology
propo
        <xref ref-type="bibr" rid="ref4">sed by Cinkova et al. (2012</xref>
        ).
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Generative Lexicon Theory</title>
      <p>
        As point
        <xref ref-type="bibr" rid="ref5">ed out by Hanks (2014</xref>
        : 1), the CPA
methodology relies theoretically on the Theory
of Norms and Exploitations (TNE), which has its
roots in Sinclair’s work, but is also influenced by
Pustejovsky’s Generative Lexicon Theory
        <xref ref-type="bibr" rid="ref16">(1995
&amp; 1998; Pustejovsky &amp; Ježek 2008)</xref>
        , thus
bridging the gap between corpus linguistics and
semantic theories of the lexicon.
      </p>
      <p>
        In his theory, Pustejovsky tries to account for the
semantic richness of natural language focusing
on the compositional aspects of lexical
semantics. According to this framework, lexical
meaning is not an intrinsic feature of lexical
items, but is generated by means of their
contextual interaction, following the so-called
principles for strong compositionality. As
outlined in
        <xref ref-type="bibr" rid="ref21">Ježek (2016</xref>
        : 78), these principles
operate at a sub-lexical level targeting specific
aspects of word meaning – such as SemTypes –
and are able to provide different interpretations
for a wide range of lexical phenomena.
      </p>
      <p>The principle of co-composition, for instance,
offers an alternative take on verbal polysemy
with respect to traditional accounts. If we
consider lexical items expressing verb arguments
to be as semantically active and influential as the
verb itself (Pustejovsky 2002: 421), we do not
need to think of verbs as polysemous, but rather
conceive their meaning as contextually defined
by the SemTypes of the surrounding arguments.
For instance, if we apply this reasoning to the
Croatian verb pair PITI/POPITI (= TO DRINK,
imperfective/perfective), we can notice how its
meaning changes depending on what is said to be
“drunk”, namely a [BEVERAGE] (1), a [DRUG]
(2) or a {GOAL} (3).</p>
      <p>(1) [[HUMANNOM] PIJE [[BEVERAGE]ACC]</p>
      <p>Djeca ne piju kavu.</p>
      <p>Children don’t drink coffee.
(2) [[HUMANNOM] PIJE</p>
      <p>Većina ljudi pije
Most people take
[[DRUG]ACC]
antibiotike na svoju ruku.</p>
      <p>antibiotics on their own initiative.
(3) [[HUMAN_FOOTBALL PLAYER]NOM]</p>
      <p>Pavić
Pavić</p>
      <p>POPIJE {GOL}
je popio gol.
failed to score a goal.</p>
      <p>
        As for metonymic phenomena, in this framework
they take the name of semantic type coercions
        <xref ref-type="bibr" rid="ref16">(Pustejovsky 2002: 425; Pustejovsky &amp; Ježek
2008, Ježek &amp; Quochi 2010)</xref>
        . Unlike
cocomposition instances, coercions do not cause
shifts in verb meaning, but rather operate
semantic type adjustments to the verb’s
selectional requirements within a given pattern.
For instance, when a verb such as POPITI
combines with a Direct Object with the semantic
type [CONTAINER] in a context where it should
select [BEVERAGE], it is instantiating a
metonymic shift which enables us to interpret the
given [CONTAINER] as the [BEVERAGE] itself,
like in example (4).
      </p>
      <p>(4) [[HUMANNOM] POPIJE
Stipe je popio
Stipe drank</p>
      <p>[[CONTAINER]ACC]
čašu.
a glass.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Croatian-specific challenges</title>
      <p>Being a Slavic language, Croatian displays a
certain number of language-specific features,
which had to be taken into account when setting
up the new editor for CROATPAS, such as its
case system, the consequent absence of
prepositions when case markings are providing
information on clause roles and verbal aspect.
We implemented an editor which is proving to be
able to tackle those challenges.</p>
      <p>For instance, the following example (5) taken
from the verb POSLATI (= TO SEND, perfective)
shows how the addition of case markings as
bottom-right indexes has proven essential to
make the resource user-friendly: had they not
been there, the absence of the preposition “to” in
Croatian would have made Theme and Recipient
morphologically undistinguishable from one
another.</p>
      <p>(5) [[HUMAN]NOM] POŠALJE [[ARTEFACT]ACC] [[HUMAN]DAT]
Marija je poslala pismo gradonačelniku.</p>
      <p>
        Marija sent a letter TO the mayor.
For what concerns sentence structure, like the
acronym suggests, the Croatian Typed Predicate
Argument Structure resource leans on valency
theory, where no distinction is made between
subject and obligatory complements, since they
are all considered essential
        <xref ref-type="bibr" rid="ref1 ref2">verb arguments
(Ježek 2016</xref>
        : 112). However, the editors of both
TPAS and CROATPAS still rely on traditional
clause-role labels for the underlying syntactic
annotation, thus distinguishing subjects from
objects and other obligatory complements.
Also traditional Croatian grammar distinguishes
between clause roles, but the classification is
heavily influenced by the Croatian case system
and the use of prepositions. Croatian makes use
of seven morphological cases – nominative,
genitive, dative, accusative, vocative, locative
and instrumental – which go by th
        <xref ref-type="bibr" rid="ref3">e name of
padeži (Barić et al. 1997</xref>
        : 101)5. Subjects are
usually expressed by the nominative case (6)
(ibidem, 421), apart from some logical subjects
appearing in the dative case (7).
(6) Ivan-Ø je simpatičan-Ø
      </p>
      <p>Ivan-NOM is nice-NOM
‘Ivan is nice’
(7) Vrti mi se
(It) spins I.DAT REFL
‘I feel dizzy’
Direct objects (ibidem, 431) are expressed either
by the accusative (8) or the genitive case (9), in
case the context calls for a partitive genitive
(ibidem, 435).
5 Please note that, for the purpose of this paper, we limit the
morphological glosses to case labels. However, the
following examples show a number of typological features
worth paying attention to, such as the fact that Croatian is a
pro-drop language, it does not have articles and has an
SVO word order. Here is a list of the abbreviations that we
used: NOM (nominative), GEN (genitive), DAT (dative), ACC
(accusative), LOC (locative), INS (instrumental), REFL
(reflexive particle), Q (question particle).
(8) Irin-a čita</p>
      <p>Irina-NOM reads
‘Irina reads a book’
knjig-u
book-ACC
(9) Hočeš li kruh-a?
(you) need Q bread-GEN
‘Do you want some bread?’
Indirect objects are expressed either by the
genitive (10), dative (11) or instrumental case
(12) (ibidem, 436).
(10) Bojim se smrt-i
(I) fear REFL death-GEN
‘I am afraid of death’
(11) Veselim se Božić-u
(I) rejoice REFL Christmas-DAT
‘I look forward to Christmas’
(12) Revolver-om je lako rukovati</p>
      <p>Revolver-INS (it) is easy to handle
‘It is easy to handle a revolver'
Another distinction made in traditional Croatian
grammar is the one between non-prepositional
and prepositional objects (ibidem, 443): subjects,
direct objects and the above-mentioned indirect
objects all fall within the first category, whereas
those objects in the accusative (13) or locative
case (14) requiring a preposition obviously
belong to the prepositional ones.
(13) Preselit ću se u Amerik-u</p>
      <p>
        To move (I) will REFL to America-ACC
‘I am moving to America’
(14) Živim u Zagreb-u
(I) live in Zagreb-LOC
‘I live in Zagreb’
This being said, in order to facilitate future
multilingual linking between resources, an
attempt was made to keep the template of
clauserole components for CROATPAS as adherent as
possible to its Italian counterpart. Here is a list of
the final clause-role labels used in CROATPAS:
1) SUBJECT – nominative and dative subjects
2) OBJECT – direct objects in the accusative case
and partitive genitives
3) INDIRECT COMPLEMENT – indirect objects in
the genitive, dative or instrumental case and
prepositional objects
4) ADVERBIAL – to be used for those obligatory
complements expressed by adverbs
5) CLAUSAL – for both clausal objects and
subjects (sub-labels further specify which)
6) PREDICATIVE COMPLEMENT – of both object
and subject (sub-labels further specify which)
Since both TPAS and CROATPAS are first and
foremost semantic resources, the same verb
pattern can contain different syntactic
realizations. For instance, the corpus
concordances behind the pattern displayed by
example (6) contain sentences where the
SemType [INFORMATION] is assigned to both
Objects in the accusative case and Clausal
Objects, mostly introduced by Croatian
complementizers such as DA, ŠTO (both
equivalents of THAT) or KAKO (HOW).
Last but not least, verbal aspect had also to be
taken into account during the set up of
CROATPAS. Aspect is a grammatical category
which applies to verbs only, offering “different
ways of viewing the internal temporal
constituency of a situation” (Comrie 1976: 3).
Those verbs characterised by an imperfective
aspect are able to report about actions while they
are being carried out, while others – the
perfective ones – focus on the completion of
such actions. In some languages, aspect can be
expressed through the choice of tense (in Italian,
imperfetto vs. passato remoto or passato
prossimo) or by means of periphrases (in
English, the -ing form). On the other hand,
Slavic languages such as Croatian present a set
of prefixes and suffixes that are able to create
socalled aspectual pairs or vidski parnjaci from one
of th
        <xref ref-type="bibr" rid="ref3">e two forms (Barić et al. 1997</xref>
        : 226).
to read : ČITATI – PROČITATI (imperfective/ perfective)
to write : PISATI – NAPISATI (imperfective/ perfective)
to announce : OBJAVITI – OBJAVLJIVATI
(imperfective/ perfective)
For each aspectual pair, patterns were extracted
keeping the perfective and imperfective variants
separate in the resource, as if they were two
different verbs. Thus, by comparing the pattern
inventories of the two aspects in each pair, we
are able to evaluate to what extent aspectual
differences influence verb meaning.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Related works</title>
      <p>
        As we have already mentioned, CROATPAS is
the sister project of the TPAS resourc
        <xref ref-type="bibr" rid="ref5">e for Italian
(Ježek et al. 2014</xref>
        ). Both resources follow the
CPA methodology (see § 2), which is also
applied in the Pattern Dictionary of English
Verbs (PDEV, Hanks &amp; Pustejovsky 2005;
        <xref ref-type="bibr" rid="ref5">El
Maarouf et al. 2014</xref>
        ) and in its Spanish
counterpart (PDSV6).
      </p>
      <p>
        Existing reference dictionaries for Croatian are
the e-Glava7 online valency dictionary of
Cro
        <xref ref-type="bibr" rid="ref19">atian verbs (Birtić et al. 2017</xref>
        ) and the
Croatian Valence Lexicon of Verbs
(CROVALLEX8, Mikelić Preradović et al.
2009). Unlike CROATPAS, e-Glava focuses
only on 57 psychological verbs, whose meanings
have been selected from pre-existing dictionaries
and linked to valency patters, which have been
manually extracted from various Croatian
corpora. Each argument in e-Glava is described
on a morphological, syntactic and semantic level.
As for morphology, the resource takes into
account cases, prepositions and sentential
realisations such as the complementizers ŠTO,
DA, KAKO etc. Ten complement classes are
specified at a syntactic level, namely Nominative
Complement, Genitive Complement, Dative
Complement, Accusative Complement,
Instrumental Complement, Prepositional
Complement, Adverbial Complement,
Predicative Complement, Infinitive Complement
and Sententi
        <xref ref-type="bibr" rid="ref19">al Complement (Birtić et al. 2017</xref>
        :
45). On a semantic level, the resource takes into
account semantic role labelling (Agent, Patient,
etc.), but has not yet introduced any
hierarchically organised tagset of SemTypes as
CROATPAS does.
      </p>
      <p>
        Another important lexicographic reference work
for Croatian is CROVALLEX
(MikelićPreradović et al. 2009), the first project aiming at
building a lexicon of valence frames for Croatian
verbs. Its syntactic-semantic classes are taken
from VerbNet (Kipper-Schuler 2005), which is
        <xref ref-type="bibr" rid="ref9">based on Levin’s verb classes (1993</xref>
        ). Once
again, morphological information such as case
markings and preposition are displayed, as well
as semantic roles, but there is no mention of
SemTypes. Overall the semantic resource
CROATPAS is complementary to existing
resources that focus primarily on the
morphosyntactic layer.
6
      </p>
    </sec>
    <sec id="sec-7">
      <title>Multilingual pattern linking and computer-assisted language learning</title>
      <p>
        As pointed out by Baisa et al. (2016b),
monolingual CPA-based dictionaries offer a
unique chance to create multilingual resources by
linking corresponding patterns, since they have
been created following the same methodology.
6 PDSV is being compiled at the Pontifical Catholic
University of Valparaíso (Chile) and is available online at:
http://www.verbario.com (last visited on July 12th 2019).
The project is coordinated by Irene Renau.
7 http://valencije.ihjj.hr/page/sto-je-e-glava/1/ (last visited
on July 12th 2019)
8http://theta.ffzg.hr/crovallex/data/html/generated/alphabet/i
ndex.html (last visited on July 12th 2019)
An early attempt of bilingual pattern linking was
carried
        <xref ref-type="bibr" rid="ref12">out by Popescu &amp; Ježek (2013</xref>
        ), who
aligned CPA patterns of English and Italian
using examples from the parallel corpus RTE3.
Translation pairs were automatically extracted
from the corpus and assigned to the
corresponding patterns in the source and target
language. The study was aimed at testing
whether pattern-based translation is more likely
to preserve meaning than Google translations,
which was proven to be the case. More recently,
Baisa et al. (2016a &amp; 2016b) carried out further
studies aimed at linking verb patterns from
PDEV and its Spanish counterpart (PDSV) via
their shared semantic types following both
manual procedures and heuristic-based
algorithms. Following Baisa,
        <xref ref-type="bibr" rid="ref1 ref2">Vonšovský (2016</xref>
        )
worked on the automatic linking of PDEV and
VerbaLex (Hlavácková 2008), a verb valency
lexicon for Czech.
      </p>
      <p>Starting in September 2019, an attempt is being
made to cross-linguistically align a sample of 50
verb entries from CROATPAS with their Italian
and English counterparts in TPAS and PDEV.
We are interested in developing a flexible,
semiautomatic, Italian-driven procedure able to
disambiguate and link verb patterns across
languages by matching their overlapping
semantic contexts.</p>
      <p>Perfect matches are already clearly foreseeable
for verb patterns such as the ones in Figure 2,
where both Italian, Croatian and English encode
the meaning of “drinking a certain amount of
alcoholic beverages” using the SemType
[HUMAN] associated with the language-specific
equivalent of TO DRINK.</p>
      <p>CROATPAS:</p>
      <p>T-PAS:</p>
      <p>
        PDEV:
In order to be able to link also verb patterns
which are not a perfect match, we are developing
an algorithm able to recognize pattern similarity
by taking into account also hypernym/hyponym
relations between SemTypes. Figure 3 provides a
fitting example, which shows how different
annotation choices can result into the lumping or
separation of semantically connected patterns
containing hierarchically related SemTypes, such
as [ANIMATE] &gt; [HUMAN] &amp; [ANIMAL] or
[BEVERAGE] &gt; [WATER].
On the other hand, CROATPAS has also the
potential to become an interesting tool for
learners and teachers of Croatian as an L2 in
computer-assisted language learning (CALL),
especially if combined with a user-friendly
SKELL-inspired interface
        <xref ref-type="bibr" rid="ref8">(Kilgarriff et al.
2015)</xref>
        .
      </p>
      <p>
        As its creators put it, SKELL (Sketch Engine for
Language Learners) is “a stripped-down,
nonscary version of Sketch Engine”, which grants
learners access to:
a summary of a word’s grammatical and
collocational behaviour (Word Sketch);
prototypical example sentences (Good
Dictionary Examples) chosen by the
GDEX algorithm
        <xref ref-type="bibr" rid="ref6">(Kilgarriff et al. 2008)</xref>
        ;
word clouds of similar words, i.e. words
that share most collocations with the
headword;
corpus concordance lines
In the case of CROATPAS, displaying Good
Dictionary Examples for each of the identified
patterns could be a good way to provide real-life
context and optional access to more concordance
lines could be given to advanced learners. Word
clouds displaying the lexical sets populating the
SemTypes might also offer an eye-catching
opportunity for computer-assisted vocabulary
lessons.
      </p>
      <p>
        At the moment, a resource which is probing
these waters is Woordcombinaties: a Dutch tool
aimed at combining access to collocations,
idioms and valency patterns for
computerassisted second language learning and teachin
        <xref ref-type="bibr" rid="ref17">g
(Colman &amp; Tiberius 2018</xref>
        ). This Dutch
Collocation, Idiom and Pattern Dictionary
focuses on a selection of mid-frequency lexical
verbs and aims at offering immediate access to
usage patterns from a toolbar, whose search
options are: verbs in example sentences, Word
Sketches with collocates, pattern-meaning pairs
and pragmatic-oriented conversational routines
(ibidem. 239). As underlined by the authors,
tailor-made examples and Word Sketches can
provide a good first impression of an unknown
verb, while pattern-meaning pairs are thought for
“advanced learners trying to find target
collocates or seeking confirmation of their
intuitions regarding a collocation” (ibidem. 240).
7
      </p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>
        In this paper, we introduced CROATPAS, a
corpus-based digital collection of verb valency
structures with the addition of semantic type
specifications (SemTypes) to each argument slot.
The resource relies on Pustejovsky’s Generative
Lexicon theory
        <xref ref-type="bibr" rid="ref16">(1995, 1998; Pustejovsky &amp;
Ježek 2008)</xref>
        (Section 3) and is made up of four
key-components, namely: 1) a representative
corpus of contemporary Croatian (hrWac 2.2.
RELDI PoS-tagged); 2) a shallow ontology of
SemTypes; 3) a methodology for Corpus Pattern
Analysis
        <xref ref-type="bibr" rid="ref15">(CPA, Hanks 2004 &amp; 2013)</xref>
        ; and 4) the
adequate corpus tools (Sketch Engine). We
discussed the Croatian-specific challenges we
faced while building the editor in Section 4, and
provided an overview of the existing related
works in Section 5. In Section 6, we anticipated
the future multilingual linking of verb patterns
from CROATPAS, TPAS and PDEV, which
could provide a resource to be exploited in NLP,
automatic translation and both theoretical and
applied cross-linguistic studies. Moreover,
CROATPAS could become an interesting tool
for computer-assisted language learning (CALL).
      </p>
      <p>
        Baroni &amp; A. Kilgarriff (2006). Large
Linguistically-Processed Web Corpora for Multiple
Languages. In: Proceedings of the XI Conference
of the European Chapter of the Association for
Computational Linguistics (EACL). Trento, Italy.
M. Birtić, I. Br
        <xref ref-type="bibr" rid="ref19">ač, S. Runjaić (2017</xref>
        ). The Main
Features of the e-Glava Online Valency Dictionary.
In: Proceedings of the 5th eLex conference
Electronic lexicography in the 21st century.
Leiden, Netherlands.
B. Comrie (1976). Aspect: An introduction to the
study of verbal aspect and related problems.
Cambridge: Cambridge University Press (6th
edition).
      </p>
      <p>T. De Mauro (2016). Il Nuovo Vocabolario di Base
della lingua italiana. Available at the website:
https://www.dropbox.com/s/mkcyo53m15ktbnp/nu
ovovocabolariodibase.pdf?dl=0 (last visited on
July 12th 2019).</p>
      <p>
        I. El M
        <xref ref-type="bibr" rid="ref7">aarouf, J. Bradbury, P. Hanks (2014</xref>
        ).
PDEVlemon: a Linked Data implementation of the
Pattern Dictionary of English Verbs based on the
Lemon model. In: Proceedings of the 3rd
Workshop on Linked Data in Linguistics (LDL):
Multilingual Knowledge Resources and Natural
Language Processing at the Ninth International
Conference on Language Resources and
Evaluation (LREC’14). Reykjavik, Iceland.
      </p>
      <p>P. Hanks (2004). Corpus Pattern Analysis. In:
Proceedings of the XI Euralex International
Congress. Lorient, France.</p>
      <p>P. Hanks (2012). How People use words to make
Meanings. Semantic Types meet Valencies. In: A.
Bulton and J. Thomas (eds.) Input, Process and
Product: Developments in Teaching and Language
Corpora. Brno: Masaryk University Press.</p>
      <p>P. Hanks (2013). Lexical Analysis: Norms and</p>
      <p>Exploitations. Cambridge: The MIT Press.</p>
      <p>P. Hanks, E. Ježek, D. Kawahara, O. Popescu (2015).</p>
      <p>Corpus Pattern for Semantic Processing. In:
Proceedings of the Tutorials of the 53rd Annual
Meeting of the ACL and the 7th IJCNLP, Beijing,
China.</p>
      <p>P. Hanks &amp; J. Pustejovsky (2005). A Pattern
Dictionary for Natural Language Processing. In:
Revue française de linguistique appliquée, 10 (2),
pp. 63-82.</p>
      <p>D.</p>
      <p>Hlavácková (2008). Databáze slovesnchý
valenčních rámců VerbaLex (Database of Verb
Valency Frames VerbaLex), PhD Thesis, Masaryk
University, Brno, Czech Republic.</p>
      <p>
        E.
        <xref ref-type="bibr" rid="ref21">Ježek (2016</xref>
        ). The lexicon: An introduction.
      </p>
      <p>Oxford: Oxford University Press.</p>
      <p>E. Ježek &amp; V. Quochi (2010). Capturing Coercions in
Texts: a First Annotation Exercise. In: Proceedings
of the VII conference on International Language
Resources and Evaluation (LREC). Valletta, Malta.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Baisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Može</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Renau</surname>
          </string-name>
          (
          <year>2016a</year>
          ).
          <source>Linking Verb Pattern Dictionaries of English and Spanish. Presented at the 5th Workshop on Linked Data in Linguistics: Managing</source>
          ,
          <article-title>Building and Using Linked Language Resources</article-title>
          . Portorož, Slovenia.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Baisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Može</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Renau (2016b). Multilingual CPA: Linking Verb Patterns Across Languages</article-title>
          .
          <source>In: Proceedings of the XVII Euralex International Congress. Tbilisi</source>
          , Georgia.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Barić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lončarić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Malić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pavešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenčević</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Znika</surname>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Hrvatska gramatika</article-title>
          . Zagreb: Skolska knjiga.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Cinkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Holub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rambousek</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Smejkalova</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A database of semantic clusters of verb usages</article-title>
          .
          <source>In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC '12)</source>
          . Instanbul, Turkey.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Ježek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feltracco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Popescu</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>T-PAS: A resource of Typed Predicate Argument Structures for linguistic analysis and semantic processing</article-title>
          .
          <source>In: Proceedings of the Ninth conference on International Language Resources and Evaluation (LREC)</source>
          . Reykjavik, Iceland.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Husák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mcadam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rychlý</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>GDEX : automatically finding good dictionary examples in a corpus</article-title>
          .
          <source>In: Proceedings of the 13th EURALEX International Congress</source>
          (pp.
          <fpage>425</fpage>
          -
          <lpage>432</lpage>
          ). Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Baisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bušta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakubíček</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michelfeit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rychlý</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suchomel</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The Sketch Engine: ten years on</article-title>
          .
          <source>In: Lexicography</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marcowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomas</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Corpora and Language Learning with the Sketch Engine and SKELL</article-title>
          . In: Revue française de linguistique appliquée,
          <volume>20</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>61</fpage>
          -
          <lpage>80</lpage>
          . Kipper-Schuler (
          <year>2005</year>
          ).
          <article-title>VerbNet: A broad</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>B.</given-names>
            <surname>Levin</surname>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>English Verb Classes and Alternations: A Preliminary Investigation</article-title>
          . Chicago: The University of Chicago Press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>N. Mikelić</given-names>
            <surname>Preradović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Boras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kišiček</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>CROVALLEX: Croatian Verb Valence Lexicon</article-title>
          .
          <source>In: Proceedings of the 31st International Conference on Information Technology Interfaces</source>
          . Zagreb, Croatia.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          &amp; T.
          <string-name>
            <surname>Erjavec</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene</article-title>
          .
          <source>In: Text, Speech and Dialogue, Lecture Notes in Computer Science</source>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>O.</given-names>
            <surname>Popescu</surname>
          </string-name>
          &amp; E.
          <string-name>
            <surname>Ježek</surname>
          </string-name>
          (
          <year>2013</year>
          ),
          <article-title>Verbal Phrase Translation, Tralogy Session 2 - Sense and Machine</article-title>
          . URL: http://lodel.irevues.inist.fr/tralogy/index.php?
          <source>id=21</source>
          <volume>6</volume>
          &amp;
          <article-title>format=print (last visited on July 12th</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>The Generative Lexicon</article-title>
          . Cambridge: The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>The semantics of lexical underspecification</article-title>
          .
          <source>In: Folia Linguistica</source>
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hanks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <source>Automated Induction of Sense in Context. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING)</source>
          . Geneva, Switzerland.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          &amp; E.
          <string-name>
            <surname>Jezek</surname>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Semantic Coercion in Language: Beyond Distributional Analysis</article-title>
          .
          <source>In: Italian Journal of Linguistics</source>
          , vol.
          <volume>20</volume>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Rehm &amp; S. Hegele</surname>
          </string-name>
          (
          <year>2018</year>
          ),
          <article-title>Language Technology for Multilingual Europe: An Analysis of a LargeScale Survey regarding Challenges, Demands, Gaps and Needs</article-title>
          .
          <source>In: Proceedings of the XI Language Resources and Evaluation Conference (LREC</source>
          <year>2018</year>
          ). Miyazaki, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Rehm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goetcherian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mermer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Váradi</surname>
          </string-name>
          , S. KirchmeierAndersen, G. Stickel,
          <string-name>
            <given-names>M. Prys</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gramstad</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>An Update and Extension of the META-NET Study “Europe's Languages in the Digital Age”</article-title>
          .
          <source>In: Proceedings of the Workshop on Collaboration and Computing for UnderResourced Languages in the Linked OpenData Era (CCURL</source>
          <year>2014</year>
          ). Reykjavik, Iceland.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Špikić</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Croato compatto: dizionario croato/italiano e italiano/croato</article-title>
          , Zanichelli: Bologna.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Tadić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brozović-Rončević</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kapetanović,</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Hrvatski Jezik u Digitalnom Dobu - The Croatian Language in the Digital Age</article-title>
          . In: METANET White Paper Series,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rehm &amp; H. Uszkoreit</surname>
          </string-name>
          (eds.), Springer: Heidelberg, New York, Dordrecht, London.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Vonšovsky</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Automatic Linking of the Valency Lexicons PDEV</article-title>
          and
          <string-name>
            <surname>VerbaLex (MA Thesis</surname>
          </string-name>
          <article-title>)</article-title>
          . URL:http://is.muni.cz/th/359500/fi_m/AutomaticLi nking.
          <source>pdf (last visited on July 12th</source>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>