=Paper= {{Paper |id=Vol-1749/paper47 |storemode=property |title=Sardinian on Facebook: Analysing Diatopic Varieties through Translated Lexical Lists |pdfUrl=https://ceur-ws.org/Vol-1749/paper47.pdf |volume=Vol-1749 |authors=Irene Russo,Simone Pisano,Claudia Soria |dblpUrl=https://dblp.org/rec/conf/clic-it/RussoPS16 }} ==Sardinian on Facebook: Analysing Diatopic Varieties through Translated Lexical Lists == https://ceur-ws.org/Vol-1749/paper47.pdf
Sardinian on Facebook: Analysing Diatopic Varieties through Translated
                            Lexical Lists
                  Irene Russo                                     Simone Pisano
                   ILC CNR                                 Università Guglielmo Marconi
                      Pisa                                              Roma
           irene.russo@ilc.cnr.it                         s.pisano@unimarconi.it

                                         Claudia Soria
                                           ILC CNR
                                             Pisa
                                  claudia.soria@ilc.cnr.it

                     Abstract                           usually very poorly represented digitally (Soria,
                                                        2016).
    English. Presence of regional and minor-            Since poor digital representation of regional and
    ity languages over digital media is an in-          minority languages further prevents their usability
    dicator of their vitality. In this paper, we        on digital media and devices, it is extremely im-
    want to investigate quantitative aspects of         portant to enhance every bottom-up effort that can
    the use on Facebook of the Sardinian lan-           boost the quantity of available digital content. In
    guage. In particular, we want to focus              fact, if the perception of the marginal role and lim-
    on the co-existence of diatopic varieties.          ited applicability of RMLs persists, their attrac-
    We extracted linguistic data from public            tiveness diminishes.
    pages and, through the translation of the           An increase in quantity of digital content avail-
    most frequent words, we find out similari-          able online represents today an opportunity for re-
    ties and differences between varieties.             gional and minority languages. Online speakers
    Italiano. La presenza e l’ uso delle lingue         can make visible the existence of a community that
    regionali e minoritarie sui mezzi digitali è       uses the language to interact; they can use online
    un indicatore della loro vitalità. In questo       communication to converge toward a standard and
    lavoro vogliamo concentrarci sugli aspetti          they can instruct less skilled speakers toward bet-
    quantitativi del sardo usato su Facebook.           ter mastering of the rules of the language, espe-
    In particolare, vogliamo analizzare le va-          cially when the language is not formally included
    rietà diatopiche estraendo i dati linguis-         in education. From the perspective of computa-
    tici dalle pagine pubbliche. Mediante la            tional linguistics, the presence of digital content
    traduzione delle parole più frequenti ab-          written in RMLs means that corpora can be built
    biamo trovato similarità e differenze tra le       for them and basic tools (lemmatizers, spell check-
    varietà.                                           ers, lexicons etc.) can be developed.
                                                        The presence of RMLs over digital media and their
                                                        usability through digital devices is often limited to
1   Introduction                                        instances of digital activism and/or by means of
Everyday life makes an increasingly extensive use       cultural initiatives focused on the preservation of
of digital devices that involve language use; for       cultural heritage.
this reason, usability of a language over digital de-   In this paper we promote the first study we are
vices is a sign for that language of being mod-         aware of about the use on social networks (more
ern, relevant to current lifestyles and capable of      specifically, Facebook) of Sardinian, an Italian
facing the needs of the XXI century. A positive         minority language characterised by the coexist-
correlation between presence in new technologies        ence of varieties and the difficulties for the pro-
and better appreciation of a language has been re-      moted standard to emerge as unifying factor. Our
peatedly observed in the literature, see for instance   starting hypothesis concerned the vitality on so-
(Eisenlohr, 2004) and (Crystal, 2010). Regional         cial networks of a language that is mainly spoken.
and minority languages (RMLs henceforth) are            With the help of a Sardinian linguist, we identi-
fied a small set of FB public groups where specific     This proposal was sharply criticised by some sec-
varieties of Sardinian are chosen as their main lan-    tors of the public opinion and strong disapproval
guage plus groups where generic, not further de-        came even from a part of native speakers, espe-
fined Sardinian is used to communicate. We ex-          cially from the South, who considered this stan-
tracted messages from these pages and created a         dard too much different from the language they
frequency lexicon for each variety. The most fre-       spoke. It is a fact that it never became a model
quent 150 words have been translated by a Sar-          of official Sardinian.
dophone expert linguist with the aim of finding         In 2006, another model of written language was
differences and commonalities between varieties.        made official by the Regional Committee resolu-
This preliminary analysis is the first step toward      tion n◦ 16/14. This standard, called LSC (Limba
the use of computational linguistics methodolo-         Sarda Comuna, Common Sardinian Language)2
gies in the promotion of a standard for Sardinian       made the effort of taking into account also the dia-
based on quantitative data.                             lects of the transition region of the center men-
                                                        tioned earlier. Although regional administration
2       Sardinian today: Main Varieties and             recommended its use for written public documents
        Standardization Efforts                         it is still reluctantly accepted by some speakers,
Sardinian is an autonomous Romance language             who perceive it as too distant from the varieties
spoken in the island of Sardinia. According to          they speak.
(Lupinu, 2007) it is known by approximately             In 2010, the Provincial Council of Cagliari took
68,4% of the population of the island. Ethno-           a different course choosing with the Provincial
logue1 lists four varieties for Sardinian: North-       Committee resolution n◦ 17 a linguistic norm3
western Sardinian or Sassarese (100,000 speakers        based on literary language of Southern poets and
ca.), Campidanese (500,000 speakers ca.), Central       writers, in order to draw up acts, documents and
Sardinian or Logudorese (500,000 speakers ca.)          even textbooks for primary children.
and Gallurese (100.000 speakers ca.)                    All these standardization efforts, politically
The most important differences from a lexical,          guided or emerged bottom-up, clearly show that
phonological and morphological point of view            Sardinian speakers are aware of the role of stan-
within Sardinian can be found between Central-          dard orthography and grammar for the vitality and
Southern and Central-Northern dialects.                 the survival of their language. On the one hand,
Scholars use to divide Sardinian in two main vari-      they want to promote the idea of a unique lan-
eties: Logudorese and Campidanese, the first one        guage as a matter of identity; on the other, they
spoken in the North and in the center of the island     dont want to lose local peculiarities by adopting
and the second one spoken in the South.                 standard rules that inevitably hide some local dif-
Logudorese and Campidanese can be related to            ferences.
two different pre-existing written standards: the       Social media are widely used by Sardinian speak-
so-called Logudorese (or Logudorese illustre) was       ers and they represent an interesting scenario for
used for the first time in a short poem at the end of   written but informal use of the language. An in-
the XV century (Manca, 2002), whereas what is           depth analysis of the type of language used by
known as Campidanese was the language of some           Sardinian speakers on social media is still miss-
religious plays at the end of the XVII Century (De      ing. Certainly, use of everyday Sardinian in spo-
Martini Abdullah Luca, 2006).                           ken and written (online) informal communication,
Today, Sardinian lacks of a generally agreed stan-      is a sign of vitality of the language. Interaction
dard variety, although standardization efforts char-    is a powerful instrument for standardization, and
acterised the recent history of the Region.             the interactive modality offered by social media
The first attempt to introduce a written system         could reveal the emergence of coordination strate-
based on an integration of phonetic, lexical and            2
                                                              Regione Autonoma della Sardegna (2006), Limba Sarda
morphological features of modern Sardinian vari-        Comuna. Norme linguistiche di riferimento a carattere sper-
eties was made in 2001, when the basic rules of         imentale per la lingua scritta dellAmministrazione regionale,
                                                        Cagliari, Regione Autonoma della Sardegna.
LSU (Limba Sarda Unificada,Unified Sardinian                3
                                                              Arrègulas po sortografia, sa fonètica, sa morfologia e su
Language) were presented (Blasco Ferrer, 2001).         fueddàriu de sa bariedadi Campidanesa de sa lı̀ngua sarda
                                                        (Rules for orthography, phonetic, morphology and the vocab-
    1
        www.ethnologue.com                              ulary of Campidanese variety of Sardinian language)
gies toward a standard in speakers community as               the four sets of Facebook groups analysed are re-
a natural need (Burghardt, 2016). To check this               ported. In Table 3 each possible pair of varieties is
hypothesis, we started to analyse the use of dif-             compared by checking the overlapping of trans-
ferent varieties of Sardinian that is being made on           lations into Italian. The second column reports
Facebook. According to the preliminary data of a              how many Italian types are in common between
recent survey, Facebook is the social media that is           two varieties. For example, among the most fre-
most used by Sardinian speakers, and where Sar-               quent 150 LSC word forms and the 150 most fre-
dinian is actively and extensively used4 .                    quent Sardu word forms, 61 words have the same
                                                              Italian translation. The third column contains the
3       Data Extraction and Analysis                          number of words with the same word forms in the
We selected public pages and communities on                   two varieties compared, e.g. the Italian adjective
Facebook that are rich in content and interactions            grande has the same word form (mannu) in Nu-
between users. With the help of a Sardinian lin-              goresu and Campidanese. This is a first attempt to
guist we identified four mutually exclusive sets:             understand if two varieties are close orthograph-
                                                              ically, considering the orthographic forms of the
    • pages where people communicate in LSC;                  analysed words. We also report the number of con-
                                                              tent words found in each pair because we believe
    • pages where people communicate in Sar-
                                                              that in the future the overlapping at orthographic
      dinian without further specification of the
                                                              level should be analysed taking into account the
      chosen variety;
                                                              distinction between content and function words.
    • pages where people communicate in Campi-                The fourth column contains the number of the
      danese;                                                 word forms related to the types in common which
                                                              are different in the two varieties e.g. for the Ital-
    • pages where people communicate choosing a               ian word è, third singular person of verb to be in
      local variety (in our case Nugoresu, local va-          the present form, LSC has just one word est, while
      riety of Logudorese).                                   Campidanese has est and esti. In this case esti is
                                                              counted as a different form and is included in the
All the messages have been extracted from the
                                                              table under the fourth column.
json of the pages obtained through Facebook API.
                                                              Table 4 summarises for each pair how variability
Lowercase texts have been tokenized splitting on
                                                              patterns are distributed, where pattern 1 to 2 means
whitespaces. Four frequency lists have been cre-
                                                              that there is one word form for variety a that cor-
ated, emoticons and symbols have been deleted.
                                                              respond to two word forms for variety b. We know
The 150 most frequent words have been trans-
                                                              that the group Sardu contains data from more than
lated in Italian by a Sardinian linguist that pro-
                                                              one variety and we plan as future work a more de-
vided also PoS and morphological annotation plus
                                                              tailed analysis. For the moment we note a clear
all the available translations in case of polysemous
                                                              overlapping because speakers of LSC contribute
words. We left in these lists Italian words because
                                                              with posts and comments on pages where people
every cleaning procedure (lists of Italian words,
                                                              communicate in Sardinian. For the same reason,
PoS for Italian etc.) was risky: very frequent
                                                              when Sardu is one of the item in the pair we no-
words in Sardinian can be found in Italian too (e.g.
                                                              tice more variability patterns (see Table 4).
a, chi, bonus, cosa) with a different meaning.
                                                              Concerning the comparisons between LSC and the
Table 1 reports basic statistics about public pages
                                                              two main varieties Campidanese and Logudorese,
and communities in the four sets listed above. Act-
                                                              represented in our data by the local variety Nu-
ive users are the ones who wrote at least one mes-
                                                              goresu, we found evidence of the distance be-
sage on the page. Number of active users and mes-
                                                              tween the two main varieties with an overlapping
sages varies for each set but it was not possible to
                                                              of 41,5% in terms of word forms. LSC and Cam-
get a balanced sample.
                                                              pidanese have an overlapping of 64,2% while LSC
In Table 2 the number of tokens and types for
                                                              and Nugoresu have an overlapping of 83%. LSC
    4
    Preliminary data of the DLDP Survey (www.dldp.eu)         emerges as a variety that tried to set a linguistic
”Su Sardu: una limba digitale?”. In July 2016, Facebook ap-
pears to be used by 98,1% of the respondents. Of those, 44%
                                                              common ground and achieved this result, even if
use Sardinian for writing and reading posts and messages,     there is a bias toward Logudorese variety, one of
and 32,5% only for reading.
 page name                                                       type           #users   #active users   #messages   #variety
 LSC, Limba Sarda Comuna: Sotziedade pro sa limba sarda comuna   Community      590      27              160         LSC
 Iscritores in limba sarda                                       Public Group   331      49              916         LSC
 Amigosde-sa-Limba-Sarda-Comuna                                  Community      1673     13              40          LSC
 Solu in sardu                                                   Public Group   15890    5701            373430      generic Sardinian
 Solu poesias                                                    Public Group   2018     158             1679        generic Sardinian
 Scrieusu in campidanesu                                         Public Group   1984     576             17960       Campidanese
 Cabuderra lngua e cultura                                       Public Group   116      1               18          Campidanese
 Sos chi li piacheta faveddare e a iscrivere nugoresu            Public Group   984      438             1157        Nugoresu

                                    Table 1: Basic statistics about data extracted.

     FBgroup              tokens        types fr >10                lectual and practical skills to create, share, and
     LSC                   71018             847                    reuse online digital content. DLDP fully embraces
     Sardu                3300408          18248                    a bottom-up approach to language revitalization
     Campidanese           257110           2285                    by addressing the speakers cognitive and practical
     Nugoresu              379802           3412                    skills as the cornerstone of effective revitalization
                                                                    initiatives.
 Table 2: Basic statistics about token and types.
                                                                    Acknowledgments
                                                                    This work is partially funded by the Erasmus +
the complaints of Campidanese speakers (see par.
                                                                    DLDP Project (Grant Agreement no. 2015-1-
2).
                                                                    IT02-KA204-015090). The opinions expressed
4   Conclusion and Future Work                                      reflect only the authors view and the Erasmus+
                                                                    National Agency and the Commission are not re-
In this paper we address the following open ques-                   sponsible for any use that may be made of the in-
tion: could quantitative analysis of written data                   formation contained.
help Sardinian community to find out a common
core (not specific of a variety) that could reinvig-
orate the idea of a standard? We plan future work
                                                                    References
on this issue, with the awareness that digital con-                 Blasco Ferrer E., Bolognesi, R. et al. 2001. Limba
tent on social media is both an opportunity and a                     Sarda Unificada. Sintesi delle norme di base: or-
                                                                      tografia, fonetica, morfologia, lessico. Cagliari, Re-
challenge for this kind of analyses.                                  gione Autonoma della Sardegna.
This paper is a first analysis of diatopic varieties
of Sardinian through orthographical comparisons                     Burghardt, M., Granvogl, D. and Wolff, C. 2016. Cre-
                                                                      ating a Lexicon of Bavarian Dialect by Means of
of word forms with the same meaning. Thanks to                        Facebook Language Data and Crowdsourcing. Pro-
translated lists it was possible to look at common-                   ceedings of LREC-2016. Portoroz, Slovenia.
alities and differences between varieties. Social
                                                                    Crystal, D. 2010. Language Death. Cambridge Uni-
media are a source of real data about language uses                   versity Press.
and the best observatory for regional and minority
languages. Concerning Sardinian Facebook offers                     De Martini Abdullah Luca (ed.) 2001. Libro de Co-
                                                                      medias (by Antonio Maria da Esterzili). Cagliari,
the possibility to test the distance between the pro-
                                                                      Cuec.
posed orthographic standard and the existing vari-
eties. We will test the interplay between varieties                 Eisenlohr, P. 2004. Language revitalization and new
with other methodologies to measure the distance                       technologies: Cultures and electronic mediation and
                                                                       the refiguring of communities. Annual Review of
and to find out usage patterns (e.g. Levenshtein                       Anthropology. 18(3):339361.
distance for similar words).
This work is being carried out in the frame-                        Lupinu, G., Mongili, A. , Oppo, A. , Spiga, R. , Perra,
                                                                      S. , Valdes, M. 2007. Le lingue dei sardi: una
work of the project DLDP (Digital Language Di-                        ricerca sociolinguistica. Assessorato alla Pubblica
versity Project, http://www.dldp.eu). DLDP is a                       istruzione, beni culturali, informazione, spettacolo e
three year project funded under the Erasmus+ pro-                     sport, Regione Autonoma della Sardegna.
gramme. It aims at addressing the problem of low                    De Martini Abdullah Luca (ed.) 2002. Sa Vitta et sa
digital representation of EU regional and minor-                      Morte, et Passione de sanctu Gavinu, Prothu et Jan-
ity languages by giving their speakers the intel-                     uariu (by Antonio Cano). Cagliari, Cuec.
                                common types        types with same word forms   types with different word forms
 LSC-Sardu                          61                 60 (21 content words)          32 (11 content words)
 LSC-Campidanese                    67                 43 (14 content words)          44 (17 content words)
 LSC-Nugoresu                       65                 54 (14 content words)          39 (17 content words)
 Sardu-Campidanese                  65                 47 (15 content words)          64 (26 content words)
 Sardu-Nugoresu                     70                 64 (16 content words)          65 (33 content words)
 Campidanese-Nugoresu               81                 34 (12 content words)          82 (27 content words)

                              Table 3: Comparison between Sardinian varieties.

                                 1 to 2    1 to 3   1 to 4
 LSC - Sardu                       8         4         1
 LSC - Campidanese                 3         0         0
 LSC - Nugoresu                    7         1         1
 Sardu - Campidanese               5         1         5
 Sardu - Noguruse                  8         3         5
 Campidanese - Nugoresu            2         0         1

Table 4: Comparison between Sardinian varieties.


Soria, C., Russo, I. , Quochi, V., Hicks, D., Gurrutxaga,
  A., Sarhimaa, A. and Tuomisto, M. 2016. Fostering
  digital representation of EU regional and minority
  languages: the Digital Language Diversity Project.
  Proceedings of LREC-2016. Portoroz, Slovenia.
Virdis, M. 1988. Sardisch: Areallinguistik / Aree
  linguistiche. Holtus G., Metzeltin, M., Schmitt,
  C. (eds.), Lexicon der Romanistischen Linguistik 4,
  Tubingen, Max Niemeyer, pp. 897-913.
Wagner, M. L. 1997. La lingua Sarda. Storia, spirito e
  forma. Nuoro, Ilisso.