=Paper= {{Paper |id=Vol-3878/37_main_long |storemode=property |title=ItGraSyll: A Computational Analysis of Graphical Syllabification and Stress Assignment in Italian |pdfUrl=https://ceur-ws.org/Vol-3878/37_main_long.pdf |volume=Vol-3878 |authors=Liviu Dinu,Ioan-Bogdan Iordache,Simona Georgescu,Alina Maria Cristea,Bianca Guita |dblpUrl=https://dblp.org/rec/conf/clic-it/DinuIGCG24 }} ==ItGraSyll: A Computational Analysis of Graphical Syllabification and Stress Assignment in Italian== https://ceur-ws.org/Vol-3878/37_main_long.pdf
                                ItGraSyll: A Computational Analysis of Graphical
                                Syllabification and Stress Assignment in Italian
                                Liviu P. Dinu1,3,* , Bogdan Iordache1,3 , Bianca Guita3 , Simona Georgescu2,3 and Alina Cristea3
                                1
                                  University of Bucharest, Faculty of Mathematics and Computer Science, Romania
                                2
                                  University of Bucharest, Faculty of Foreign Languages and Literatures, Romania
                                3
                                  Human Language Technologies Research Center, Bucharest, Romania


                                                 Abstract
                                                 In this paper we build a dataset of Italian graphical syllables (called ItGraSyll). We perform quantitative and qualitative
                                                 analyses on the syllabification and stress assignment in Italian. We propose a machine learning model, based on deep-learning
                                                 techniques, for automatically inferring syllabification and stress assignment. For stress prediction we report 94.45% word-level
                                                 accuracy, and for syllabification we report 98.41% word-level accuracy and 99.82% hyphen-level accuracy.

                                                 Keywords
                                                 syllabification, stress assignment, Italian,



                                1. Introduction                                                                                        “prosodic revolution” [10] from Latin to the Romance lan-
                                                                                                                                       guages – including syncope (the loss of an intermediate
                                Word syllabification and syllable analysis are two related syllable) and apocope (the loss of the final syllable) at a
                                issues of great importance in the study of language (writ- large scale – has led to major changes, but their weight is
                                ten or spoken). These topics have attracted a large cat- different from one idiom to another: while the Western
                                egory of researchers, from pure linguists, in phonetics, Romance languages manifest highly evident differences
                                to psycholinguists, computer scientists, speech thera- from the Latin phonological and prosodic system, and the
                                pists, etc. Thus, the syllable plays an important role in Eastern languages are considered to be most conservative
                                language learning and acquisition, speech recognition, from this point of view, Italian seems to be in between
                                speech production [1, 2], language similarity [3], in text [10]. On the other hand, in Latin, the relation between
                                comprehensibility (Kincaid-Flesch formula [4]), in speech stress and quantity grew stronger, thus short stressed
                                therapy, in poetry analysis [5, 6], etc. Each language has vowels progressively gained length. It is noteworthy that
                                its own way of grouping sounds into syllables and its own this situation is best preserved in Italian, and not in the
                                rules for dividing words into syllables. Linguistically, the Eastern Romance idioms: thus, in Italian stress cannot
                                syllable represents "the smallest phonetic trance likely skip a heavy penultimate syllable, and stress cannot fall
                                to receive an accent and only one" [7], and the syllabic further back than the antepenultimate syllable, a twofold
                                cut is seen by De Saussure [8] on the border between the characteristic feature of the Latin prosodic system. This
                                implosion and the explosion of the spoken sound: "If in is why we are taking Italian as a starting point for a larger-
                                a chain of sounds one goes from implosion to explosion, scale study, oriented towards all Romance languages. The
                                one obtains a particular effect which is the indication of main difference between Latin and its modern descen-
                                the boundary of the syllable".                                                                         dants is that Latin stress was quantity- sensitive, leading
                                   The analysis of the words’ syllabic structure also plays thus to the following rule: in polysyllabic words, stress
                                an important part in historical linguistics [9], not only fell on a heavy penultimate (meaning, containing a long
                                in diachronic phonetics and phonology, but also in lexi- vowel), otherwise on the antepenultimate. Due to the
                                cology. Romance comparative linguistics, in particular, collapse of vowel quantity as a distinctive feature in the
                                still needs a detailed overview of this aspect, as syllable, vocalic system, no Romance language has retained the
                                segmentation and prosody can give strong account on Latin stress rule as such [10]. As, from a statistic point of
                                phonetic changes that haven’t been explained yet. The view, the greatest part of the Romance lexicon is repre-
                                                                                                                                       sented by penultimate stressed words, a basic automatic
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, mechanism would assign penultimate stress by default,
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.
                                                                                                                                       whereas for both final and antepenultimate stress, the
                                $ ldinu@fmi.unibuc.ro (L. P. Dinu);                                                                    machine (as well as, not in a few cases, non-native speak-
                                iordache.bogdan1998@gmail.com (B. Iordache);                                                           ers) would need further specification. As a consequence
                                bianca.guita@s.unibuc.ro (B. Guita);                                                                   of the loss of Latin vowel quantity, Romance stress has
                                simona.georgescu@lls.unibuc.ro (S. Georgescu);                                                         ceased to be completely predictable. That is, partially,
                                alinaciobanu20@gmail.com (A. Cristea)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License why in the majority of the traditional Romance compara-
                                           Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tive or historical grammars, there is no specific section      other linguistic factors that those rules take into account.
devoted to syllabification [11], or, if there is, it focuses   For example, a rule that is present in many languages
either on general prosodic features [12], or on the vowel      distinguishes between a vowel and a semivowel, but the
evolution depending on its presence in an open or closed       computer is not able to easily recognize when the same
syllable [13]. The lack of a section dedicated to syllab-      sign has the value of a vowel and when it is a semivowel.
ification is also common in the historical grammars of         Because of this, rule-based adaptations of syllabification
Italian [14, 11, 15]. We will focus in this research only      systems [26] generally have higher errors, and many lan-
on written form of words, so we will investigate only          guages do not have an automatic syllabification system
the graphical syllabification and stress. By focusing on       yet (for example, in the Python library, only a few lan-
the graphical syllabification and stress in Italian, we aim    guages have syllabification). The last few decades have
to take a step forward towards the complete evaluation         brought the first data-driven syllabification systems.
of the prosodic changes that took place in the transition         However, in order to build such a system, training
from Latin to the Romance languages, and their influence       data is needed, and there are many cases in which the
on the Romance phonetics and phonology. A machine-             available data do not cover the whole language, and thus
learning model, capable of automatically inferring graph-      the systems have different results when the test corpus
ical syllabification and stress assignment, along with the     is changed.
purpose of creating a data-base containing the quanti-            Starting with these remarks, our main contributions
tative and qualitative description of syllabification and      are:
stress in the Romance languages, could be the first im-
portant task in the greater challenge of tracing the simi-             • We propose ItGraSyll (Italian graphical syllables),
larities and differences between the Romance languages                   a dataset of 114, 503 Italian words, in ortho-
and, more important, between Romance and Latin. From                     graphic form, containing annotations for their or-
a typological point of view, the study of syllabification                thographic syllabification and stress placement1
and stress can shed a new light on the universal features              • We perform quantitative and qualitative analyses
that, by defining our phonoarticulatory and phonoacous-                  of the previously built dataset.
tic apparatus, have guided the languages’ development                  • We analyze stress placement in the context of the
and change. Given the promising results of this analysis,                Italian syllables.
the present study can establish the basis of a research of             • We propose an automatic system of syllabification
the syllable in other languages, either linguistically or                for Italian words.
typologically related to Italian.
   One of the studies that address automatic syllabifi-
cation in Italian belongs to Bigi and Petrone [16], who        2. Quantitative Analysis
proposed a tool that performs rule-based automatic seg-
mentation. Adsett and Marchand [17] and Adsett et al.          In this section we perform various measurements regard-
[18] investigated whether data-driven approaches out-          ing the syllables and stress placement of Italian written
perform rule-based approaches for a language with a            words and analyze the results. We perform, on Italian,
low syllabic complexity, such as Italian. The authors          an investigation similar to a previous investigations con-
reached the conclusion that even in this case data-driven      ducted on Romanian by Dinu and Dinu [27], Dinu and
systems are the more appropriate approach. In terms of         Dinu [28].
machine learning, the tasks of automatically inferring syl-
lable boundaries and predicting stress assignment can be       2.1. Data
naturally framed as sequence labeling problems. While
                                                               We build a dataset of Italian words starting from the
automatic syllabification has received more attention re-
                                                               online version of Dizionario italiano De Mauro,2 which
cently [19, 20, 21, 22, 23, 24], stress placement has not
                                                               provides information regarding graphical syllabification
been investigated as much [25].
                                                               and stress placement for the Italian vocabulary. Stressed
   Given the complexity of syllable applications and word
                                                               syllables are also shown by having accents on the domi-
syllabification, the presence of electronic resources dedi-
                                                               nant vowel. Going further, this dataset will be referred
cated to them becomes a necessity. While native speakers
                                                               to as ItGraSyll.
of a language generally do not have great difficulty in
                                                                  We performed several pre-processing steps. We
spelling words, the same cannot be said of those who
                                                               cleaned the resulted dataset by removing duplicates, pre-
learn a foreign language who often tend to apply their
                                                               fixes and suffixes in order to remain with the base word;
own rules to foreign words, and problems arise in au-
tomatic syllabification. This is because the rules of syl-     1
                                                                   The dataset is available for research purposes upon request at:
labification are linguistic rules, and they cannot always          https://nlp.unibuc.ro/resources.html#itgrasyll
be easily modeled by the computer when there are no            2
                                                                   https://dizionario.internazionale.it/
abbreviations and unwanted punctuation marks such                           Index     Syllable     Frequency
as dots, commas, apostrophes and dashes were also ex-                         1          to           23943
cluded so we can correctly process each word and its                          2          re           18199
syllable division. Finally, the dataset consists of 114, 503                  3          ta           12796
words in orthographic form having between one and                             4          te           10987
eleven syllables. The distribution of words per number                        5          si           10026
of syllables is represented in Table 1.                                       6           a           9142
                                                                              7          co           8874
    #syll.   #words                Examples
                                                                              8          ri           8868
                                                                              9          ca           8478
      1       722                         ai                                  10         ra            8388
      2      5,960                     àc-cia                                 11        na             8367
      3      23,286                   àb-ba-co
                                                                              12         ti            8184
      4      41,253                 a-ba-chì-sta
                                                                              13         ne            8112
      5      28,357                 a-bi-tà-co-lo
      6      10,829             ac-cu-mu-la-zió-ne                            14        men            7841
      7      3,294             au-ten-ti-fi-ca-zió-ne                         15         la            7175
      8       650             a-e-ro-mo-del-lì-sti-co                         16         di            6663
      9       132             bi-o-me-te-o-ro-lo-gì-a                         17         le            6555
     10        16         in-tel-let-tu-a-li-sti-ca-mén-te                    18         li            6176
     11         5      ge-ne-ra-ti-vo-tra-sfor-ma-zio-nà-le                   19        no             5748
                                                                              20         lo            5479
Table 1
Number of words per number of syllables.                       Table 2
                                                               Top 20 most frequent syllables.


2.2. Syllables
                                                               the most frequent consonant-vowel structures are the
We identified #𝑇 𝑦𝑝𝑒𝑠𝑦𝑙 = 3730 (type syllables) in             following: a) for the type syllables: cvc (25%), ccvc (20.9%),
Italian. The total number of syllables (token syllables)       cvvc (7.79%). b) for the token syllables: cv (58%), cvc (15%),
is #𝑇 𝑜𝑘𝑒𝑛𝑠𝑦𝑙 = 483, 931. So, the average length               ccv (7%), cvv (4.74%) and v (4.32%). Moreover, we observe
of a word measured in syllables is 𝑊 𝑜𝑟𝑑𝑠𝑎𝑣−𝑠𝑦𝑙 =              that the cv structure corresponds to 40 out of the most
483,931/114,503 = 4.226. The 114,503 words are formed of       frequent 50 syllables from the dataset.
#𝐿𝑒𝑡𝑡𝑒𝑟𝑠 = 1,133,515 letters (graphemes). So, the aver-
age length of a word measured in letters is 𝑊 𝑜𝑟𝑑𝑎𝑣−𝑙𝑒𝑡
                                                               2.4. Stress Placement
= 1,133,515/114,503 = 9.899.
   In order to characterize the average length of a syllable   We identified a total of 2,883 stressed syllables (type syl-
measured in letters, we investigated two cases: a) the         lables). So, 847 syllables are never stressed. The most
average length of the token syllables measured in letters      frequent 20 stressed syllables are represented in Table 3.
is: 𝐿𝑆𝑦𝑙𝑡𝑜𝑘𝑒𝑛 = 1,133,515/483,931 = 2.342 b) the type          We observe that the most frequent stressed syllable (men)
syllables are formed of #𝑇 𝑦𝑝𝑒𝑆𝑦𝑙𝑙𝑒𝑡 = 13,576 letters.         has a very high stress ratio (90%) when we compare the
Thus, the average length of a type syllable measured in        stressed occurrences with all its occurrences (stressed
letters is 𝐿𝑆𝑦𝑙𝑡𝑦𝑝𝑒 = 13,576/3,730 = 3.639.                    and unstressed) in our database. While in the top 20 of
   These statistics are computed for the words extracted       all syllables, men is the only syllable of length 3 (on the
from the dictionary, which were considered to be equally       14th position), for stressed syllables there are a couple
weighted. This excludes any information relating to the        of other syllables with a length greater than 2 (zio on
frequency of the words with respect to writing or speech.      position 6 with 34% stress ratio, gia on position 19 with
For future research, large corpora of Italian texts can be     65% stress ratio).
leveraged in order to recompute these values and include          We investigate stress placement with regard to syllable
frequency-based weights.                                       structure and we provide in Table 4 the percentages of
   A list of the most frequent 20 syllables is included in     words having the stress placed on different positions (for
Table 2.                                                       top 5), counting syllables from the beginning and from
                                                               the end of the words as well. We observe that in most
2.3. Syllable Structure                                        cases the stress is placed on the second to last syllable.

We identified a total of 67 different consonant-vowel
structures. The most frequent 7 structures cover almost
97% of the total. Depending on the type-token ratio,
   Index     Syllable     Frequency          Stress ratio (%)     100 cover 74% and the most frequent 150 syllables (i.e.
     1           men            7120               90             4% of #𝑇 𝑦𝑝𝑒𝑠𝑦𝑙 ) cover 80% of #𝑇 𝑜𝑘𝑒𝑛𝑠𝑦𝑙 . Over this
     2            ta            5809               45             number, the percentage of coverage rises slowly. 2,281
     3           na             3348               40             (61%) syllables of type syllables occur less then 10 times,
     4            to            3254               15             and 1,174 syllables occur only once (hapax legomena).
     5            la            2978               41
     6           zio            2916               76             2.5.2. Stressed Syllables
     7             ti           2820               34
     8            ca            2461               29             A similar trend can be observed also for the stressed syl-
     9            ra            2297               27             lables. Further, we notice that the most frequent syllables
     10            li           2239               36             cover a wide ratio of the total syllable frequency. For
     11            ri           2100               24             example, the 10 most frequent stressed syllable represent
     12           tu            2024               62
                                                                  31% of the total of stressed syllables, the top 50 syllables,
     13           za            2022               42
                                                                  60% and the top 200 syllables, 81% of the token syllables.
     14           ni            1734               40
     15           tri           1458               60             The values are plotted in Figure 1, for all syllables and
     16          ma             1209               25             for stressed syllables.
     17           si            1144               11
     18          da             1109               43                                            Type
                                                                                     0.8    all syllables
     19          gia            1081               65                                       stressed syllables
     20          mi             1052               25                                0.7


Table 3                                                                              0.6



                                                                          Coverage
Top 20 most frequent stressed syllables. The stress ratio indi-                      0.5
cates how often out of all the occurrences of the syllable in
                                                                                     0.4
the corpus it appears as stressed.
                                                                                     0.3


                                                                                            25          50       75     100     125     150   175   200
      Syllable      %words             Syllable   %words                                                          Number of syllables

      1st             8,611            1st          3,330
      2nd            25,544            2nd         94,225         Figure 1: The coverage of most frequent syllables.
      3rd            40,568            3rd         16,113
      4th            25,593            4th             14
      5th             9,243            5th              1            This results proves that the law is true for Italian too,
    (a) counting syllables from (b) counting syllables from
                                                                  a very small number of syllables cover a large part from
        the beginning of the        the end of the word           Italian language (there are necessary only 150 syllables
        word                                                      to cover 80% from language).

Table 4                                                           3. Minimum Effort Laws
Stress placement for Italian.
                                                                  In this section we discuss two minimum effort laws that
                                                                  have been previously investigated for other languages
2.5. Syllables’ Usage                                             and verify whether they apply for Italian as well.
The syllables have a less intuitive behaviour, usually a
small number of syllables cover a large part from a lan-          3.1. Chebanow
guage. This is valuable for a large category of natural           Denoting by 𝐹 (𝑛) the∑︀frequency∑︀  of a word having n
languages, including English, Dutch, Romanian [28], Ko-           syllables and by 𝑖 =     𝑛𝐹 (𝑛)/ 𝐹 (𝑛) the average
rean, Chinese, etc. We investigate here if this empirical         length (measured in syllables) of the words, Chebanow
law is also applicable to Italian. We made this investiga-        [29] proposed the following law between the average 𝑖
tion both on stressed and general syllables.                      and the probability of occurrences 𝑃 (𝑛) of the words
                                                                  having n syllables:
2.5.1. General Syllables
                                                                                                             (𝑖 − 1)𝑛−1
The most frequent 30 Italian syllables (when stress place-                                 𝑃 (𝑛) =                      * 𝑒1−𝑖                            (1)
                                                                                                              (𝑛 − 1)!
ment is disregarded) cover almost 50% of #𝑇 𝑜𝑘𝑒𝑛𝑠𝑦𝑙 , the
most frequent 50 syllables cover 61%, the most frequent             For Italian, 𝑖 = 4.226.
(a) The probability distribution of the    (b) Theoretical representation of the prob-   (c) Menzerath’s Law: The more syllables in
    length of words.                           ability distribution of the length of         a word, the smaller its syllables.
                                               words.

Figure 2: Minimum effort laws.


                 Model                                           Hyphen Acc.        Hyphen F1       Word Acc.
                 GRU for syllabification w/o stress markers        99.74%             99.69%         97.61%
                 GRU for syllabification w/ stress markers         99.82%             99.79%         98.41%
                 GRU for stress prediction                           —                  —            94.45%

Table 5
Performance metrics computed for the automatic syllabification and stress prediction on the test set. We computed accuracy
and F1 scores on the sequence labelling predictions for syllabification, in order to assess how well the model predicts the
positions where the syllables split. Word level metrics were computed for both syllabification and stress prediction; this kind
of metrics are more strict since any misplaced hyphen in the syllabification makes the entire prediction wrong.



   In Figures 2a and 2b we plot the probability distribution 4. Automatic Syllabification and
of the length of words (in syllables) – the practical and
theoretical representations.
                                                                Stress Assignment
   We observe that the two curves have comparable We further investigate how a deep-learning model can au-
shapes, with a more prominent peak for the probabil- tomatically infer the syllabification and stress assignment
ity distribution in Figure 2a; this peak can be influenced of Italian words, given their orthographic representation.
by the fact that it is determined based on all the words in
the dictionary, where many 4-syllable words are present.
                                                                 4.1. Methodology
3.2. Menzerath                                                   Both tasks can be defined in terms of a sequence la-
                                                                 belling problem, strategy which was previously success-
Menzerath’s law – later generalized by the Menzerath-            ful used for Romanian[31, 32]. Let us consider, for ex-
Altmann law [30] – states that the bigger the number of          ample, the word medaglione (the Italian translation of
syllables in a word, the lesser the number of phonemes           the word "locket"). For syllabification we can label each
composing these syllables. In other words, Menzerath’s           letter from the word either with the label 1, denoting
law expresses a negative correlation between the length          that a syllable starts from that letter, or with the label
of a word in syllables and the lengths in phonemes of its        0, meaning the respective letter is not the first letter in
constitutive syllables. In cognitive economy terms, this         its syllable. Similarly, for identifying the stressed vowel,
means that the more complex a linguistic construct, the          we can label its position with a 1 and all other letters
smaller its constituents. The law is expressed as follows:       are assigned the label 0. We thus obtain for our exam-
                                                                 ple the sequence 1010100010 for syllabification and the
                       𝑦 = 𝛼𝑥𝛽 𝑒−𝛾𝑥                        (2)
                                                                 sequence 0000000100 for stress prediction (i.e. me-da-
where 𝑦 is the syllable length (the size of the constituent),    gliò-ne, the o vowel is stressed).
𝑥 is the number of syllables per word (the size of the lin-         With these definitions, we can now construct machine
guistic construct), and 𝛼, 𝛽, 𝛾 are empirical parameters.        learning models for labelling the character sequences.
Figure 2c shows that the law is satisfied for Italian.           The model we propose is a recurrent neural network
                                                                 based on Gated Recurrent Units (GRU) [33]. The model ar-
                                                                 chitecture is comprised from the following components:
     • a character embedding layer, producing 64-              4.2. Results Anaysis
       dimensional vectors for each unique character
                                                           Table 5 contains the metrics computed on the test set,
     • a stacked bidirectional GRU, with 3 layers and a
                                                           using the models trained for syllabification (both with
       128-dimensional hidden state; a 0.2-rate dropout
                                                           and without stress markers) and the model trained for
       applied after each of the first two layers
                                                           predicting the stressed vowel. We obtained a remarkable
     • 0.5-rate dropout, after the last GRU layer, along   hyphen accuracy of 99.74% for syllabification without
       with one-dimensional batch normalization            the stress markers, and, when we add the stress markers,
     • a time-distributed fully-connected layer with 256   we obtained an increasing accuracy, obtaining 99.82%.
       output nodes and ReLU activation                    Including the stress markers into the data used for syl-
     • a linear layer that projects the 256-dimensional    labification improved the metrics across the board, most
       vector into a single number, on which sigmoid       notably with a ∼ 1% increase in word-level accuracy,
       activation is applied to infer the binary labels.   which considering the large amount of data, and the high
                                                           accuracy scores is a significant improvement (460 fewer
   For training the models for both tasks, the dataset of
                                                           syllabification mistakes as opposed to the approach that
words is split into 50% training examples and 50% test
                                                           excludes stress markers). Regarding the stress prediction,
examples, unseen during training.
                                                           we obtained an accuracy of 94.45%. Table 6 showcases a
   The loss function computed for the prediction made
                                                           series of wrong predictions generated by the models on
for a word, regardless of the task on which the model
                                                           the tests sets for stress assignment and syllabification.
is trained, is the average of two terms: the first one is
                                                              We also look into the accuracy scores computed for
the average character-wise binary cross-entropy, while
                                                           the test set, when it is bucketed based on the real number
the second one is the root mean squared error computed
                                                           of syllables of the test words. These results are shown
between the vector of predicted labels and the ground-
                                                           in Figure 3 and Table 7. For stress assignment, accu-
truth vector. The model is optimized using the Adam
                                                           racy decreases to a global minimum for disyllabic words,
optimizer [34], with a learning rate of 0.0003, no weight
                                                           then starts to increase again with the number of syllables.
decay, bath size of 32, and a LR scheduler that halves it
                                                           For the syllabification task, including the stress markers
every 5 epochs. The models are trained for 10-15 epochs.
                                                           seems to outperform excluding them in most scenarios,
   For the task of automatic syllabification, we wanted
                                                           while both accuracies achieve a peak around the 5 sylla-
to check if the presence of the stress markers affects the
                                                           bles mark. This result seems to align with the distribution
performance of the model. Because of that, we trained
                                                           of syllables in the dataset, i.e. obtaining higher scores
two models: the first one was trained using the spelling
                                                           for the number of syllables with more examples. For
of the words with the stress markers removed, while the
                                                           stress assignment errors, we also investigate the place-
second one was trained with them included.
                                                           ment of the predicted stressed syllable in relation with
                                                           the true one (see Table 8). 95.6% of the errors misplaced
                 Stress Assignment Errors
                                                           the stressed syllable at most one position to the left, or
                  True            Predicted                to the right, while almost two thirds of the erroneous
                 bàlano             balanò                 predictions placed the stress on the first syllable to the
                 fèmore             femòre                 right of the correct one.
                dòlmen              dolmèn
                 tùtolo              tutòlo
                pudìco              pùdico
                 corsìa              còrsia                                  100.0

                                                                              97.5
                  Syllabification Errors
                                                                              95.0
                 True            Predicted
                                                                  Accuracy




               mu-o-ne             muo-ne                                     92.5
               bion-da            bi-on-da                                    90.0
               cli-en-te           clien-te
              co-di-a-to          co-dia-to                                   87.5                                     Task
                                                                                                          Stress Assignment
              ma-nu-brio         ma-nu-bri-o                                  85.0                        Syllabification (w/o stress markers)
              spa-tria-to        spa-tri-a-to                                                             Syllabification (w/ stress markers)
                                                                                     1   2   3   4    5      6     7      8     9    10    11
Table 6                                                                                              Num. Syllables
Examples of erroneous test predictions provided by the deep-   Figure 3: The test accuracies for each of the three tasks,
learning models.                                               computed independently on the test words, bucketed by their
                                                               true number of syllables.
   Num. Syllables       Num. Words        Stress Assignment       Syllabification (w/o SM)       Syllabification (w/ SM)
                   1               721             99.03%                    83.63%                        84.88%
                   2             5,960             92.94%                    96.56%                        97.80%
                   3            23,286             94.46%                    98.55%                        99.19%
                   4            41,253             97.42%                    99.03%                        99.48%
                   5            28,357             98.92%                    99.33%                        99.49%
                   6            10,829             99.48%                    99.23%                        99.26%
                   7             3,294             99.67%                    99.15%                        99.15%
                   8               650             100.0%                    99.23%                        98.46%
                   9               132             100.0%                    99.24%                        99.24%
                  10                16             100.0%                    93.75%                        93.75%
                  11                 5             100.0%                    100.0%                        100.0%

Table 7
Similar to Figure 3 this table contains the actual values of the test accuracies for the three tasks: stress assignment, and
syllabification with/without stress markers (SM) included. These scores are computed separately for words with the same
number of syllables.


                                  Stressed Syllable Delta      Num. Errors       Pct. Errors
                                              -2                           21       0.74%
                                              -1                          804      28.38%
                                              0                            95       3.35%
                                              1                         1,809      63.85%
                                              2                           102       3.60%
                                              3                             2       0.07%

Table 8
Starting from the incorrect predictions for stress assignment, we compute how far the assigned stress is from the actual one,
in numbers of syllables (delta). A delta of −2 means that the predicted stressed syllable is the second one to the left of the
correct stressed syllable. A delta of 0 in this situation means that the algorithm predicted the stressed vowel incorrectly, but
the prediction sits inside the correct stressed syllable.



5. Conclusions                                              Innovation and Digitization, CNCS/CCCDI UEFISCDI,
                                                            SiRoLa project, number PN-IV-P1-PCE-2023-1701, Roma-
In this paper we have investigated graphical syllabifica- nia.
tion and graphical stress assignment for Italian words.
We have started by building ItGraSyll, a dataset of Italian
graphical syllabified words, with stress annotations as References
well, on which we have performed several quantitative
and qualitative analyses, including the verification of      [1] S. Suyanto, Incorporating syllabification points into
two minimum effort laws for the case of Italian. Finally,        a model of grapheme-to-phoneme conversion, In-
we have proposed a recurrent neural network machine              ternational Journal of Speech Technology 22 (2019)
learning model for automatic syllabification and stress          459–470.
assignment for Italian written words. For stress predic- [2] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro,
tion we have obtained 94.45% word-level accuracy, and            Exploring emergent syllables in end-to-
for syllabification we have obtained 98.41% word-level           end automatic speech recognizers through
accuracy and 99.82% hyphen-level accuracy. In future             model explainability technique,              Neural
work we intend to extend the analysis from dictionary            Comput. Appl. 36 (2024) 6875–6901. URL:
level to corpus level and to investigate other languages         https://doi.org/10.1007/s00521-024-09435-1.
as well.                                                         doi:10.1007/S00521-024-09435-1.
                                                             [3] A. Dinu, L. P. Dinu, On the syllabic similari-
                                                                 ties of romance languages, in: A. F. Gelbukh
Acknowledgments                                                  (Ed.), Computational Linguistics and Intelligent
                                                                 Text Processing, 6th International Conference, CI-
We want to thank the reviewers for their useful sugges-          CLing 2005, Mexico City, Mexico, February 13-19,
tions. Research supported by the Ministry of Research,           2005, Proceedings, volume 3406 of Lecture Notes
     in Computer Science, Springer, 2005, pp. 785–788.                   463. URL: https://doi.org/10.1016/j.csl.2009.02.004.
     URL: https://doi.org/10.1007/978-3-540-30586-6_88.                  doi:10.1016/j.csl.2009.02.004.
     doi:10.1007/978-3-540-30586-6\_88.                             [19] K. A. Rogova, K. Demuynck, D. V. Compernolle, Au-
 [4] J. P. Kincaid, L. R. P. F. Jr., R. L. Rogers, B. S. Chissom,        tomatic syllabification using segmental conditional
     Derivation of new readability formulas (Automated                   random fields, in: Computational Linguistics in the
     Readability Index, Fog Count and Flesch Reading                     Netherlands Journal, volume 3, 2013, pp. 34–48.
     Ease formula) for Navy enlisted personnel, Re-                 [20] L. P. Dinu, V. Niculae, O. Sulea, Romanian syllab-
     search Branch Report, Millington, TN: Chief of                      ication using machine learning, in: I. Habernal,
     Naval Training, 1975.                                               V. Matousek (Eds.), Text, Speech, and Dialogue -
 [5] G. Marco, J. de la Rosa, J. Gonzalo, S. Ros,                        16th International Conference, TSD 2013, Pilsen,
     E. González-Blanco, Automated Metric Analysis of                    Czech Republic, September 1-5, 2013. Proceedings,
     Spanish Poetry: Two Complementary Approaches,                       volume 8082 of Lecture Notes in Computer Science,
     IEEE Access 9 (2021) 51734–51746.                                   Springer, 2013, pp. 450–456.
 [6] A. M. Ciobanu, L. P. Dinu, On the romanian                     [21] J. Krantz, M. W. Dulin, P. D. Palma, Language-
     rhyme detection, in: Proceedings of COLING 2012:                    Agnostic Syllabification with Neural Sequence La-
     Demonstration Papers, 2012, pp. 87–94.                              beling, 2019 18th IEEE International Conference
 [7] L. Hjelmslev, The syllable as a structural unit, in:                On Machine Learning And Applications (ICMLA)
     the Proceedings of the 3rd International Congress                   (2019) 804–810.
     of Phonetic Sciences (Ghent), 1938, volume 266,                [22] V. N. Vitale, L. Schettino, F. Cutugno, On incre-
     1938.                                                               menting interpretability of machine learning mod-
 [8] F. De Saussure, Course in general linguistics,                      els from the foundations: A study on syllabic speech
     Columbia University Press, 2011.                                    units, in: F. Boschetti, G. E. Lebani, B. Magnini,
 [9] D. Russo, The Notion of Syllable across History,                    N. Novielli (Eds.), Proceedings of the 9th Italian
     Theories and Analysis, Cambridge Scholars Pub-                      Conference on Computational Linguistics, Venice,
     lishing, 2016.                                                      Italy, November 30 - December 2, 2023, volume 3596
[10] M. Loporcaro, Syllable, segment and prosody, in:                    of CEUR Workshop Proceedings, CEUR-WS.org, 2023.
     The Cambridge history of the Romance languages,                     URL: https://ceur-ws.org/Vol-3596/paper51.pdf.
     2011, pp. 50–108.                                              [23] O. Sulea, L. P. Dinu, B. Dumitru, Full inflec-
[11] W. Meyer-Lübke, Grammaire des langues romanes,                      tion learning using deep neural networks, in:
     volume 4, H. Welter, 1906.                                          A. F. Gelbukh (Ed.), Computational Linguistics
[12] M.-D. Glessgen, Linguistique romane: domaines                       and Intelligent Text Processing - 19th Interna-
     et méthodes en linguistique française et romane,                    tional Conference, CICLing 2018, Hanoi, Vietnam,
     Armand Colin, 2007.                                                 March 18-24, 2018, Revised Selected Papers, Part
[13] F. S. Miret, Fonética histórica, in: Manual de lingüís-             I, volume 13396 of Lecture Notes in Computer Sci-
     tica románica, Ariel España, 2007, pp. 227–250.                     ence, Springer, 2018, pp. 408–415. URL: https://doi.
[14] F. d’Ovidio, W. Meyer-Lübke, Grammatica storica                     org/10.1007/978-3-031-23793-5_33. doi:10.1007/
     della lingua e dei dialetti italiani, volume 368, U.                978-3-031-23793-5\_33.
     Hoepli, 1906.                                                  [24] M. Petrillo, F. Cutugno, A syllable segmentation al-
[15] G. Rohlfs, T. Franceschi, Grammatica storica della                  gorithm for english and italian., in: INTERSPEECH
     lingua italiana e dei suoi dialetti: Morfologia, (No                2003, 2003, pp. 2913–2916.
     Title) (1968).                                                 [25] Q. Dou, S. Bergsma, S. Jiampojamarn, G. Kondrak, A
[16] B. Bigi, C. Petrone, A generic tool for the automatic               Ranking Approach to Stress Prediction for Letter-to-
     syllabification of italian, A generic tool for the                  Phoneme Conversion, in: Proceedings of the Joint
     automatic syllabification of Italian (2014) 73–77.                  Conference of the 47th Annual Meeting of the ACL
[17] C. R. Adsett, Y. Marchand, Are Rule-based Syl-                      and the 4th International Joint Conference on Nat-
     labification Methods Adequate for Languages with                    ural Language Processing of the AFNLP: Volume 1 -
     Low Syllabic Complexity? The Case of Italian, in:                   Volume 1, ACL ’09, Association for Computational
     P. Wagner, J. Abresch, S. Breuer, W. Hess (Eds.),                   Linguistics, 2009, p. 118–126.
     Sixth ISCA Workshop on Speech Synthesis, Bonn,                 [26] L. P. Dinu,        An approach to syllables via
     Germany, August 22-24, 2007, ISCA, 2007, pp. 58–                    some extensions of marcus contextual gram-
     63.                                                                 mars, Grammars 6 (2003) 1–12. URL: https://
[18] C. R. Adsett, Y. Marchand, V. Keselj, Syllabifi-                    doi.org/10.1023/A:1024089129146. doi:10.1023/A:
     cation rules versus data-driven methods in a lan-                   1024089129146.
     guage with low syllabic complexity: The case of                [27] L. P. Dinu, A. Dinu, On the data base of romanian
     italian, Comput. Speech Lang. 23 (2009) 444–                        syllables and some of its quantitative and cryp-
     tographic aspects, in: N. Calzolari, K. Choukri,
     A. Gangemi, B. Maegaard, J. Mariani, J. Odijk,
     D. Tapias (Eds.), Proceedings of the Fifth Interna-
     tional Conference on Language Resources and Eval-
     uation, LREC 2006, Genoa, Italy, May 22-28, 2006,
     European Language Resources Association (ELRA),
     2006, pp. 1795–1798.
[28] L. P. Dinu, A. Dinu, On the behavior of romanian
     syllables related to minimum effort laws, in: Pro-
     ceedings Workshop Multilingual Resources, Tech-
     nologies and Evaluation for Central and Eastern
     European Languages, co-located with RANLP 2009,
     Borovets, Bulgaria 2006, 2009, pp. 9–13.
[29] S. Chebanow, On conformity of language structures
     within the Indoeuropean family to poisson’s law,
     Comptes rendus de l’Academie de science de l’URSS
     55 (1947) 99–102.
[30] G. Altmann, Prolegomena to Menzerath’s Law,
     Glottometrika 2 (1980) 1–10.
[31] A. M. Ciobanu, A. Dinu, L. P. Dinu, Predicting
     romanian stress assignment, in: G. Bouma, Y. Par-
     mentier (Eds.), Proceedings of the 14th Conference
     of the European Chapter of the Association for
     Computational Linguistics, EACL 2014, April 26-
     30, 2014, Gothenburg, Sweden, The Association
     for Computer Linguistics, 2014, pp. 64–68. URL:
     https://doi.org/10.3115/v1/e14-4013. doi:10.3115/
     V1/E14-4013.
[32] L. P. Dinu, A. M. Ciobanu, I. Chitoran, V. Nicu-
     lae, Using a machine learning model to assess
     the complexity of stress systems, in: N. Calzolari,
     K. Choukri, T. Declerck, H. Loftsson, B. Maegaard,
     J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.),
     Proceedings of the Ninth International Conference
     on Language Resources and Evaluation, LREC 2014,
     Reykjavik, Iceland, May 26-31, 2014, European Lan-
     guage Resources Association (ELRA), 2014, pp. 331–
     336. URL: http://www.lrec-conf.org/proceedings/
     lrec2014/summaries/1200.html.
[33] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah-
     danau, F. Bougares, H. Schwenk, Y. Bengio, Learn-
     ing phrase representations using rnn encoder-
     decoder for statistical machine translation, arXiv
     preprint arXiv:1406.1078 (2014).
[34] D. P. Kingma, J. Ba, Adam: A method for stochas-
     tic optimization, arXiv preprint arXiv:1412.6980
     (2014).