<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Aggregates and Variants in two Czech morphological approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hlavácˇová Jaroslava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics, Prague Institute of Formal and Applied Linguistics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>There exist several morphological dictionaries for Czech. They differ only in solutions of complicated morphological features. Various attempts have been made to unify their approaches, but only some of them were implemented. The paper deals with several such features and compares their solutions taken in two different projects, namely preparation of the new edition of PDT (Prague Dependency Treebank [1]) and NovaMorf [8]. The features presented in this paper are: aggregates (the wordforms without a clear part of speech, e.g. užs, ocˇ, nanˇ), and variants - inflectional (more wordforms for a particular combination of lemma and morphological tag) as well as global ones (mainly orthographic variants expressed in all wordforms of a paradigm).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>There are several “classical” parts of speech covering
almost the whole vocabulary of any language. Apart from
them, there are words, that are difficult to be assigned an
appropriate part of speech (POS). During the times, they
were carried from one POS to another. Also, the number
of POS was changing.</p>
      <p>Some words are difficult to place into a right position
among POS, because they might belong to more of them.
It results in an artificial homonymy – the same word is
included into more POS classes and due to formal
requirements they should be considered different words. There
were many attempts to solve that situation, but no
generally accepted solution has been found.</p>
      <p>
        Two recent projects dealing with the Czech
morphology tried to solve at least some issues connected to POS
classes. They are the new edition of PDT (Prague
Dependency Treebank [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and NovaMorf [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Among
others, they introduced new classes of POS, namely
Foreign Word, (Afixal) Segment and Aggregate. The
former two POS were implemented in the same way in the
both projects, while the latter one, the Aggregate, has
different implementations. The second common solution of
the both projects, but with different implementations,
concerns variants of words.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Aggregates</title>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an aggregate is a wordform that is
created by combining two or more wordforms (components
of the aggregate) into one and cannot be simply assigned
any part of speech.
      </p>
      <p>We present several examples together with their
explanations1:
1. videˇls = videˇl jsi (you saw)
2. studentas = studenta jsi (e.g. Toho studentas nevideˇl?
.. You have not seen that student?)
3. užs = už jsi (you already were)
4. donˇ = do neˇj (into it)
5. nacˇ = na co (on what)
6. nacˇs = na co jsi (on what you were)</p>
      <p>Apart from the first example, where the both
components are verbs, there are at least two different parts of
speech contained in all the aggregates. That fact makes it
difficult to assign aggregates one of traditional POS. This
is the reason why a new part of speech was introduced to
the system of the Czech morphology – Aggregate.</p>
      <p>The aggregates can be divided into three groups
(aggregate types):
1. Verbal aggregates – those aggregates where the
second component stands for the wordform jsi (you are).
The first component can be almost any POS (see the
examples 1 to 3 above). In NovaMorf, conditional
conjunctions aby, kdyby together with all their forms
abych, kdybyste, ... are considered also verbal
aggregates, which is not the case of the PDT project.
2. Pronominal aggregates – those aggregates where the
second component is the pronoun co (what) or neˇj
(lemma on = he). The first component is a preposition
(see the examples 4 and 5 above).
3. Combined aggregates – those aggregates where the
first component is a preposition, the second one is
a pronoun co or neˇj (as in pronominal aggregates),
and the third component is the auxiliar verb jsi (as in
verbal aggregates). It is a combination of the previous
two types – verbal and pronominal (see the example
6 above).</p>
      <p>1The English translations might seem meaningless without a
conThe previous list contains all sorts of Czech aggregates.
Having divided the types of aggregates, let us have a look
how to morphologically annotate them.</p>
      <p>Every wordform should be morphologically described
as a unique couple consisting of a lemma and a
morphological tag. However, there are different approaches how
to do it in case of aggregates. In the following
subsections we will introduce two of them, belonging to the two
beforementioned projects.
2.1</p>
      <sec id="sec-2-1">
        <title>NovaMorf: Multiple tag and multiple lemma for aggregates</title>
        <p>In the NovaMorf project, the lemma of an aggregate is the
sequence of lemmas of all its segments. We call such a
sequence a multiple lemma of aggregates.2</p>
        <p>According to the previous introductory text, there are
at most three different lemmas in the multiple lemma of
a Czech aggregate. This may not be the case for other,
especially agglutinative languages.</p>
        <p>
          There are only limited number (closed set) of
pronominal and combined aggregates in Czech, that is why it is
possible to describe those types of aggregates by means
of a single morphological tag. The same is possible to
achieve for the verbal aggregates, though there are
extremely productive. The description of such an annotation
is included in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>However, there was quite a big opposition against that
solution, so the researchers adopted an alternative
(equivalent) solution that is possibly more intuitive. They
introduced a concept of multiple tag, as a paralel concept to
the multiple lemma. The multiple tag is the sequence of
the tags describing all segments of an aggregate. The
sequence of tags is ordered in the same way as the sequence
of lemmas in the multiple lemma of the aggregate.</p>
        <p>The morphological description of the examples
presented above are given in the table 1.3
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>PDT: Tagging aggregates within the current framework of Prague tagging system</title>
        <p>
          The project PDT adopted a different solution (see also [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
The lemma of pronominal and combined aggregates is the
lemma of the pronoun. It can be either on or co,
eventually also copak. The lemma of verbal aggregates is the
lemma of their first component, which can be, as
mentioned above, almost any word of almost any POS. The
morphological tag of aggregates is enriched by a new
category – type of aggregate – wich can have the following
values:
        </p>
        <p>s for verbal aggregates;
2There is another use of the concept of multiple lemma for
description of variants. It will be introduced later in this paper. The main
difference between the two is that the multiple lemma of aggregates is a
sequence, while the multiple lemma of variants can be a set.</p>
        <p>
          3All the tables in this paper are presented with the morphological
tags used in the Prague dictionary Morfflex [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
initial letter of the preposition in lower case for the
pronominal aggregates;
initial letter of the preposition in upper case for the
combined aggregates.
        </p>
        <p>These values are incorporated to the morphlological
tag. The particular placement within the framework of the
Prague morphological system is the 14th position of its
positional tag.</p>
        <p>Such description of aggregates could make use of the
existing frame of the morphlological dictionary, without
necessity to change the structure of the dictionary. On the
other hand, the solution is less intuitive than the previous
case of multiple lemmas and tags.</p>
        <p>The morphological description of the examples
presented above are given in the table 2.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Variants of words</title>
      <p>Another problem of morphological tagging concerns
different types of variants (sometimes called also mutations).
There are stylistic variants, orthographic ones, historical
etc. No matter what was the origin of the variant, it should
be properly tagged as a variant. It is meaningful to join all
the variants into a common dictionary record, but it must
be done in such a way that does not disturbe the Golden
rule of morphology.</p>
      <p>
        The Golden rule of morphology (see [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]) says that every
combination of a lemma and a morphological tag should
be represented by a single wordform (if the combination
of the lemma and tag is meaningful, of course). This
statement ensures that, for instance, the two orthographic
variants of the lemma lemon, namely citron and citrón
should not have the same morphological description – the
same lemma and the same tag.
      </p>
      <p>There is a distinction between global variants
(manifested in all wordforms of a paradigm) and inflectional
ones (applied only for some combinations of
morphological values). Until recently, this distinction was not taken
into account consistently. For marking a wordform as a
variant, the 15th position of the Prague tag was used. Both
types of variants were marked in this way.</p>
      <p>Inflectional variants are those variants that relate only
to some wordforms of a paradigm defined by a
particular combination of morphological values for the identical
lemma. An example are two forms of the lemma hrad
(castle) in local singular which can be both hradu and
hradeˇ.</p>
      <p>Global variants are those variants that relate to all
wordforms of a paradigm, and always in the same way. The
word lemon presented above is the example of the global
variants.</p>
      <p>For treatment of global variants, the lemmatization is
very important. In this respect, the morphological
dictionary has been inconsistent. Some global variants shared
the same lemma, which violated the Golden rule of
morphology, because the same morphological tag combined
with that lemma was connected with two different
wordforms. Others were lemmatized as distinct lemmas, which
made impossible to link the variants. The lemma
representing the both (or sometimes more) variants should link
them, but at the same time, the Golden rule of morphology
must not be violated.
3.1</p>
      <sec id="sec-3-1">
        <title>Variants in NovaMorf</title>
        <p>
          In NovaMorf, the information about variants is added to
the morphological tag. Global variant as well as
inflectional variant became new (morphological) categories that
have a set of predefined values. The values are based
preferably on the orthographic features of individual
variants; they do not attempt to have any evaluative
meaning, such as expressive, vulgar, archaic or others.
Contrary to other morphological categories, there can be more
values of both variants. That is why a special notation
was proposed to express potentially more values of the
variants. No matter how the morphological tag is constructed
(Prague style [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], Brno style [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], or other), the traditional
tag is followed by an additional “subtag” for the variants.
The information about variants are marked with a letter G
for global variants and F for inflectional ones. Then, a set
of codes expressing the appropriate type of the variant(s)
follows.
        </p>
        <p>All the global variants are then represented by a
common lemma. Here again, the concept of multiple lemma is
used. In the case of variants, its members have always the
same POS – they are the variants themselves.</p>
        <p>Table 3 shows an example of capturing the three global
variants of the lemma thesis in NovaMorf.</p>
        <p>The inflectional variants are tagged similarly; only
instead the leading G, the letter F starts the sequense of the
inflectional variants. Naturally, the inflectional variants
need not to be represented by a multiple lemma.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Variants in PDT</title>
        <p>In the PDT project, the inflectional variants are marked as
before, at the 15th position of the tag. The list of possible
values was substantially simplified; there are code
numbers for marking literal and standard variants (1 – 5), and
code numbers for substandard variants (6 – 9). No other
finer distinction (archaic, colloquial etc.) is marked.</p>
        <p>For annotating global variants, the concept of links,
originally created for derivational relations4, was used.
For every set of variants, it is necessary to choose one of
them as a basic one. All the other global variants are then
linked with that basic variant. A style marker is assigned
to every link, but the set of its possible values is limited.</p>
        <p>The selection of the basic variant is not crucial, because
the set of links to other variants join them all, so that
they can be reached all at once. However, it is reasonable
to choose always the most common variant (in whatever
sense).</p>
        <p>The table 4 shows an example of capturing the three
global variants of the lemma thesis in the PDT project.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Summary</title>
      <p>We have presented and compared solutions of two
problematic features occuring in Czech morphology, namely
aggregates and variants. They were adopted by two
different projects, PDT and NovaMorf, that are currently being
finalized. The both projects used the same ideas, but
implementations are different.</p>
      <p>4The link connects the lemma under consideration to its derivational
origin.</p>
      <p>The NovaMorf project proposes to change some of the
basics of the Prague tagging system. It wants to implement
the concept of multiple lemma and use it for description
of aggregates (sequence of segments of the aggregate) as
well as for variants (set of lemma variants). The second
change is creation of a subtag for marking variants (global
and inflectional ones).</p>
      <p>The PDT project is more traditional and does not
introduce special formats. It is also the reason, why the changes
described in this paper, have been already implemented to
the new version of the morphological dictionary (however
not publicly released yet) only within this project.</p>
      <p>The treatment of aggregates is for the both projects
equivalent. In NovaMorf, a special category (Type of
Aggregate) was added for description of aggregates. Its value
becomes a new part of the morphological tag. In PDT,
the information about the type of the aggregate is
incorporated into the existing positional tag. However, all the
information about the aggregate, its type and its segments,
are present in both solutions. NovaMorf treats them
probably more transparently. The multiple lemma lists all the
lemmas of the segments, while the mark within the
morphlological tag (adopted by PDT) assumes that the users
would derive the information about the segments from the
mark, which is not so straightforward.</p>
      <p>The treatment of variants is not equivalent in the two
projects. NovaMorf is again more transparent — the
concepts of multiple lemma and multiple tag will enable
especially users of corpora not to take care about more
possibilities, because the dictionary itself would know them
all. Every variant is lemmatized by the same set of variant
lemmas as shown in the example presented in the table 3.
Individual lemmas (and wordforms) are then distinguished
by the variant subtag, that uses a new set of values to mark
the variants.</p>
      <p>PDT, on the other hand, selects one variant as the basic
one. Contrary to previous attempts, there is no predefined
rule which variant to select. However, the authors insisted
to preserve at least some information about the stylistic
features of the individual variants. Lemma of each global
variant is the variant itself.</p>
      <p>If there was a need to unify the two solutions in the
future, the lemma variants could be easily taken out from
the dictionary and put together to make the set of a
multiple lemma. The only thing that would need to add, would
be the type of the variant according to the values prepared
for NovaMorf. The opposite conversion, from multiple
lemma to lemma variants would be even easier — each
member of the set representing the multiple lemma would
become an independent lemma. The selection of the
central basic lemma for the link can be, as mentioned before,
arbitrary. As the two approaches are not equivalent, there
would be also needed some handwork, namely to add the
information about the style of the variants.</p>
      <p>The presented solutions could be possibly used also for
other languages, but there were no attempts undertaken to
prove it.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknoledgements</title>
      <p>The research has been supported by the LINDAT/CLARIN
and LINDAT/CLARIAH-CZ projects of Ministry of
Education, Youth and Sports of the Czech Republic
(LM2015071 and LM2018101).
[9] Petkevicˇ, V., Hlavácˇová, J., Osolsobeˇ, K., Šimandl, J.,
Svášek, M.: Microsyntactic Parts of Speech in NovaMorf, a
New Morphological Annotation of Czech. In: Proceedings of
SLOVKO 2019, to appear.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bejcˇek</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Hajicˇová</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Hajicˇ</surname>
          </string-name>
          , J. et al.:
          <article-title>Prague Dependency Treebank 3.0, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL)</article-title>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Charles University, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hajicˇ</surname>
          </string-name>
          , J.:
          <article-title>Disambiguation of Rich Inflection (Computational Morphology of Czech)</article-title>
          .
          <source>Nakladatelství Karolinum</source>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hajicˇ</surname>
          </string-name>
          , J., Hlavácˇová, J.:
          <string-name>
            <surname>MorfFlex</surname>
            <given-names>CZ</given-names>
          </string-name>
          , LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague. (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hlavácˇová</surname>
          </string-name>
          , J.:
          <article-title>Formalizace systému cˇeské morfologie s ohledem na automatické zpracování cˇeských textu˚</article-title>
          .
          <source>Disertacˇní práce. Univerzita Karlova</source>
          . (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hlavácˇová</surname>
          </string-name>
          , J.:
          <article-title>Golden Rule of Morphology and Variants of Wordforms</article-title>
          .
          <source>In: Jazykovedný cˇasopis / Journal of Linguistics</source>
          , Vol.
          <volume>68</volume>
          , No. 2,Slovak Academic Press, Bratislava, Slovakia, ISSN
          <volume>0021</volume>
          -
          <issue>5597</issue>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>144</lpage>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hlavácˇová</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikulová</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Šteˇpánková</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajicˇ</surname>
          </string-name>
          , J.:
          <article-title>Modifications of the Czech morphological dictionary for consistent corpus annotation</article-title>
          .
          <source>In: Proceedings of SLOVKO</source>
          <year>2019</year>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Osolsobeˇ</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Algoritmický popis cˇeské formální morfologie a strojový slovník cˇeštiny</article-title>
          .
          <source>Disertacˇní práce</source>
          . 161 s. (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Osolsobeˇ</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hlavácˇová</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Petkevicˇ</surname>
          </string-name>
          , V.,
          <string-name>
            <surname>Svášek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Šimandl</surname>
          </string-name>
          , J.:
          <article-title>Nová automatická morfologická analýza cˇeštiny</article-title>
          .
          <source>Naše rˇecˇ 100, No 4</source>
          , pp
          <fpage>225</fpage>
          -
          <lpage>234</lpage>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>