<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sardinian on Facebook: Analysing Diatopic Varieties through Translated Lexical Lists</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Irene Russo ILC CNR Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Simone Pisano Universita` Guglielmo Marconi Roma</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Presence of regional and minority languages over digital media is an indicator of their vitality. In this paper, we want to investigate quantitative aspects of the use on Facebook of the Sardinian language. In particular, we want to focus on the co-existence of diatopic varieties. We extracted linguistic data from public pages and, through the translation of the most frequent words, we find out similarities and differences between varieties.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Everyday life makes an increasingly extensive use
of digital devices that involve language use; for
this reason, usability of a language over digital
devices is a sign for that language of being
modern, relevant to current lifestyles and capable of
facing the needs of the XXI century. A positive
correlation between presence in new technologies
and better appreciation of a language has been
repeatedly observed in the literature, see for instance
        <xref ref-type="bibr" rid="ref5">(Eisenlohr, 2004)</xref>
        and
        <xref ref-type="bibr" rid="ref3">(Crystal, 2010)</xref>
        . Regional
and minority languages (RMLs henceforth) are
usually very poorly represented digitally
        <xref ref-type="bibr" rid="ref8">(Soria,
2016)</xref>
        .
      </p>
      <p>Since poor digital representation of regional and
minority languages further prevents their usability
on digital media and devices, it is extremely
important to enhance every bottom-up effort that can
boost the quantity of available digital content. In
fact, if the perception of the marginal role and
limited applicability of RMLs persists, their
attractiveness diminishes.</p>
      <p>An increase in quantity of digital content
available online represents today an opportunity for
regional and minority languages. Online speakers
can make visible the existence of a community that
uses the language to interact; they can use online
communication to converge toward a standard and
they can instruct less skilled speakers toward
better mastering of the rules of the language,
especially when the language is not formally included
in education. From the perspective of
computational linguistics, the presence of digital content
written in RMLs means that corpora can be built
for them and basic tools (lemmatizers, spell
checkers, lexicons etc.) can be developed.</p>
      <p>The presence of RMLs over digital media and their
usability through digital devices is often limited to
instances of digital activism and/or by means of
cultural initiatives focused on the preservation of
cultural heritage.</p>
      <p>In this paper we promote the first study we are
aware of about the use on social networks (more
specifically, Facebook) of Sardinian, an Italian
minority language characterised by the
coexistence of varieties and the difficulties for the
promoted standard to emerge as unifying factor. Our
starting hypothesis concerned the vitality on
social networks of a language that is mainly spoken.
With the help of a Sardinian linguist, we
identified a small set of FB public groups where specific
varieties of Sardinian are chosen as their main
language plus groups where generic, not further
defined Sardinian is used to communicate. We
extracted messages from these pages and created a
frequency lexicon for each variety. The most
frequent 150 words have been translated by a
Sardophone expert linguist with the aim of finding
differences and commonalities between varieties.
This preliminary analysis is the first step toward
the use of computational linguistics
methodologies in the promotion of a standard for Sardinian
based on quantitative data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Sardinian today: Main Varieties and</title>
    </sec>
    <sec id="sec-3">
      <title>Standardization Efforts</title>
      <p>
        Sardinian is an autonomous Romance language
spoken in the island of Sardinia. According to
        <xref ref-type="bibr" rid="ref6">(Lupinu, 2007)</xref>
        it is known by approximately
68,4% of the population of the island.
Ethnologue1 lists four varieties for Sardinian:
Northwestern Sardinian or Sassarese (100,000 speakers
ca.), Campidanese (500,000 speakers ca.), Central
Sardinian or Logudorese (500,000 speakers ca.)
and Gallurese (100.000 speakers ca.)
The most important differences from a lexical,
phonological and morphological point of view
within Sardinian can be found between
CentralSouthern and Central-Northern dialects.
      </p>
      <p>Scholars use to divide Sardinian in two main
varieties: Logudorese and Campidanese, the first one
spoken in the North and in the center of the island
and the second one spoken in the South.
Logudorese and Campidanese can be related to
two different pre-existing written standards: the
so-called Logudorese (or Logudorese illustre) was
used for the first time in a short poem at the end of
the XV century (Manca, 2002), whereas what is
known as Campidanese was the language of some
religious plays at the end of the XVII Century (De
Martini Abdullah Luca, 2006).</p>
      <p>
        Today, Sardinian lacks of a generally agreed
standard variety, although standardization efforts
characterised the recent history of the Region.
The first attempt to introduce a written system
based on an integration of phonetic, lexical and
morphological features of modern Sardinian
varieties was ma
        <xref ref-type="bibr" rid="ref4">de in 2001</xref>
        , when the basic rules of
LSU (Limba Sarda Unificada,Unified Sardinian
Language) were presented
        <xref ref-type="bibr" rid="ref1">(Blasco Ferrer, 2001)</xref>
        .
      </p>
      <sec id="sec-3-1">
        <title>1www.ethnologue.com</title>
        <p>This proposal was sharply criticised by some
sectors of the public opinion and strong disapproval
came even from a part of native speakers,
especially from the South, who considered this
standard too much different from the language they
spoke. It is a fact that it never became a model
of official Sardinian.</p>
        <p>In 2006, another model of written language was
made official by the Regional Committee
resolution n 16/14. This standard, called LSC (Limba
Sarda Comuna, Common Sardinian Language)2
made the effort of taking into account also the
dialects of the transition region of the center
mentioned earlier. Although regional administration
recommended its use for written public documents
it is still reluctantly accepted by some speakers,
who perceive it as too distant from the varieties
they speak.</p>
        <p>In 2010, the Provincial Council of Cagliari took
a different course choosing with the Provincial
Committee resolution n 17 a linguistic norm3
based on literary language of Southern poets and
writers, in order to draw up acts, documents and
even textbooks for primary children.</p>
        <p>All these standardization efforts, politically
guided or emerged bottom-up, clearly show that
Sardinian speakers are aware of the role of
standard orthography and grammar for the vitality and
the survival of their language. On the one hand,
they want to promote the idea of a unique
language as a matter of identity; on the other, they
dont want to lose local peculiarities by adopting
standard rules that inevitably hide some local
differences.</p>
        <p>Social media are widely used by Sardinian
speakers and they represent an interesting scenario for
written but informal use of the language. An
indepth analysis of the type of language used by
Sardinian speakers on social media is still
missing. Certainly, use of everyday Sardinian in
spoken and written (online) informal communication,
is a sign of vitality of the language. Interaction
is a powerful instrument for standardization, and
the interactive modality offered by social media
could reveal the emergence of coordination
strate2Regione Autonoma della Sardegna (2006), Limba Sarda
Comuna. Norme linguistiche di riferimento a carattere
sperimentale per la lingua scritta dellAmministrazione regionale,
Cagliari, Regione Autonoma della Sardegna.</p>
        <p>
          3Arre`gulas po sortografia, sa fone`tica, sa morfologia e su
fuedda`riu de sa bariedadi Campidanesa de sa l`ıngua sarda
(Rules for orthography, phonetic, morphology and the
vocabulary of Campidanese variety of Sardinian language)
gies toward a standard in speakers community as
a natural need
          <xref ref-type="bibr" rid="ref2">(Burghardt, 2016)</xref>
          . To check this
hypothesis, we started to analyse the use of
different varieties of Sardinian that is being made on
Facebook. According to the preliminary data of a
recent survey, Facebook is the social media that is
most used by Sardinian speakers, and where
Sardinian is actively and extensively used4.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data Extraction and Analysis</title>
      <p>We selected public pages and communities on
Facebook that are rich in content and interactions
between users. With the help of a Sardinian
linguist we identified four mutually exclusive sets:
pages where people communicate in LSC;
pages where people communicate in
Sardinian without further specification of the
chosen variety;
pages where people communicate in
Campidanese;
pages where people communicate choosing a
local variety (in our case Nugoresu, local
variety of Logudorese).</p>
      <p>All the messages have been extracted from the
json of the pages obtained through Facebook API.
Lowercase texts have been tokenized splitting on
whitespaces. Four frequency lists have been
created, emoticons and symbols have been deleted.
The 150 most frequent words have been
translated in Italian by a Sardinian linguist that
provided also PoS and morphological annotation plus
all the available translations in case of polysemous
words. We left in these lists Italian words because
every cleaning procedure (lists of Italian words,
PoS for Italian etc.) was risky: very frequent
words in Sardinian can be found in Italian too (e.g.
a, chi, bonus, cosa) with a different meaning.
Table 1 reports basic statistics about public pages
and communities in the four sets listed above.
Active users are the ones who wrote at least one
message on the page. Number of active users and
messages varies for each set but it was not possible to
get a balanced sample.</p>
      <p>In Table 2 the number of tokens and types for
4Preliminary data of the DLDP Survey (www.dldp.eu)
”Su Sardu: una limba digitale?”. In July 2016, Facebook
appears to be used by 98,1% of the respondents. Of those, 44%
use Sardinian for writing and reading posts and messages,
and 32,5% only for reading.
the four sets of Facebook groups analysed are
reported. In Table 3 each possible pair of varieties is
compared by checking the overlapping of
translations into Italian. The second column reports
how many Italian types are in common between
two varieties. For example, among the most
frequent 150 LSC word forms and the 150 most
frequent Sardu word forms, 61 words have the same
Italian translation. The third column contains the
number of words with the same word forms in the
two varieties compared, e.g. the Italian adjective
grande has the same word form (mannu) in
Nugoresu and Campidanese. This is a first attempt to
understand if two varieties are close
orthographically, considering the orthographic forms of the
analysed words. We also report the number of
content words found in each pair because we believe
that in the future the overlapping at orthographic
level should be analysed taking into account the
distinction between content and function words.
The fourth column contains the number of the
word forms related to the types in common which
are different in the two varieties e.g. for the
Italian word e`, third singular person of verb to be in
the present form, LSC has just one word est, while
Campidanese has est and esti. In this case esti is
counted as a different form and is included in the
table under the fourth column.</p>
      <p>Table 4 summarises for each pair how variability
patterns are distributed, where pattern 1 to 2 means
that there is one word form for variety a that
correspond to two word forms for variety b. We know
that the group Sardu contains data from more than
one variety and we plan as future work a more
detailed analysis. For the moment we note a clear
overlapping because speakers of LSC contribute
with posts and comments on pages where people
communicate in Sardinian. For the same reason,
when Sardu is one of the item in the pair we
notice more variability patterns (see Table 4).
Concerning the comparisons between LSC and the
two main varieties Campidanese and Logudorese,
represented in our data by the local variety
Nugoresu, we found evidence of the distance
between the two main varieties with an overlapping
of 41,5% in terms of word forms. LSC and
Campidanese have an overlapping of 64,2% while LSC
and Nugoresu have an overlapping of 83%. LSC
emerges as a variety that tried to set a linguistic
common ground and achieved this result, even if
there is a bias toward Logudorese variety, one of
In this paper we address the following open
question: could quantitative analysis of written data
help Sardinian community to find out a common
core (not specific of a variety) that could
reinvigorate the idea of a standard? We plan future work
on this issue, with the awareness that digital
content on social media is both an opportunity and a
challenge for this kind of analyses.</p>
      <p>This paper is a first analysis of diatopic varieties
of Sardinian through orthographical comparisons
of word forms with the same meaning. Thanks to
translated lists it was possible to look at
commonalities and differences between varieties. Social
media are a source of real data about language uses
and the best observatory for regional and minority
languages. Concerning Sardinian Facebook offers
the possibility to test the distance between the
proposed orthographic standard and the existing
varieties. We will test the interplay between varieties
with other methodologies to measure the distance
and to find out usage patterns (e.g. Levenshtein
distance for similar words).</p>
      <p>This work is being carried out in the
framework of the project DLDP (Digital Language
Diversity Project, http://www.dldp.eu). DLDP is a
three year project funded under the Erasmus+
programme. It aims at addressing the problem of low
digital representation of EU regional and
minority languages by giving their speakers the
intellectual and practical skills to create, share, and
reuse online digital content. DLDP fully embraces
a bottom-up approach to language revitalization
by addressing the speakers cognitive and practical
skills as the cornerstone of effective revitalization
initiatives.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially funded by the Erasmus +
DLDP Project (Grant Agreement no.
2015-1IT02-KA204-015090). The opinions expressed
reflect only the authors view and the Erasmus+
National Agency and the Commission are not
responsible for any use that may be made of the
information contained.</p>
      <sec id="sec-5-1">
        <title>LSC-Sardu</title>
      </sec>
      <sec id="sec-5-2">
        <title>LSC-Campidanese</title>
      </sec>
      <sec id="sec-5-3">
        <title>LSC-Nugoresu</title>
      </sec>
      <sec id="sec-5-4">
        <title>Sardu-Campidanese</title>
      </sec>
      <sec id="sec-5-5">
        <title>Sardu-Nugoresu</title>
      </sec>
      <sec id="sec-5-6">
        <title>Campidanese-Nugoresu</title>
      </sec>
      <sec id="sec-5-7">
        <title>LSC - Sardu</title>
      </sec>
      <sec id="sec-5-8">
        <title>LSC - Campidanese</title>
      </sec>
      <sec id="sec-5-9">
        <title>LSC - Nugoresu</title>
      </sec>
      <sec id="sec-5-10">
        <title>Sardu - Campidanese</title>
      </sec>
      <sec id="sec-5-11">
        <title>Sardu - Noguruse</title>
        <p>Campidanese - Nugoresu
8
3
7
5
8
2
61
67
65
65
70
81</p>
        <p>common types types with same word forms types with different word forms
1
0
1
5
5
1</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Blasco Ferrer</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolognesi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.
          <year>2001</year>
          .
          <article-title>Limba Sarda Unificada</article-title>
          .
          <article-title>Sintesi delle norme di base: ortografia, fonetica, morfologia, lessico</article-title>
          . Cagliari, Regione Autonoma della Sardegna.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Burghardt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Granvogl</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wolff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Creating a Lexicon of Bavarian Dialect by Means of Facebook Language Data and Crowdsourcing</article-title>
          .
          <source>Proceedings of LREC-2016</source>
          . Portoroz, Slovenia.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Crystal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <string-name>
            <given-names>Language</given-names>
            <surname>Death</surname>
          </string-name>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>De Martini Abdullah Luca</surname>
          </string-name>
          (ed.)
          <year>2001</year>
          .
          <article-title>Libro de Comedias (by Antonio Maria da Esterzili)</article-title>
          . Cagliari, Cuec.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Eisenlohr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Language revitalization and new technologies: Cultures and electronic mediation and the refiguring of communities</article-title>
          .
          <source>Annual Review of Anthropology</source>
          .
          <volume>18</volume>
          (
          <issue>3</issue>
          ):
          <fpage>339361</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Lupinu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mongili</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oppo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valdes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Le lingue dei sardi: una ricerca sociolinguistica. Assessorato alla Pubblica istruzione, beni culturali, informazione, spettacolo e sport, Regione Autonoma della Sardegna</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>De Martini Abdullah Luca</surname>
          </string-name>
          (ed.)
          <year>2002</year>
          .
          <article-title>Sa Vitta et sa Morte</article-title>
          , et Passione de sanctu Gavinu,
          <article-title>Prothu et Januariu (by Antonio Cano)</article-title>
          . Cagliari, Cuec.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Soria</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quochi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrutxaga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarhimaa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tuomisto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Fostering digital representation of EU regional and minority languages: the Digital Language Diversity Project</article-title>
          .
          <source>Proceedings of LREC-2016</source>
          . Portoroz, Slovenia.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Virdis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>1988</year>
          . Sardisch: Areallinguistik / Aree linguistiche. Holtus G.,
          <string-name>
            <surname>Metzeltin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmitt</surname>
          </string-name>
          , C. (eds.),
          <source>Lexicon der Romanistischen Linguistik</source>
          <volume>4</volume>
          ,
          <string-name>
            <surname>Tubingen</surname>
          </string-name>
          , Max Niemeyer, pp.
          <fpage>897</fpage>
          -
          <lpage>913</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>La lingua Sarda</article-title>
          . Storia, spirito e forma. Nuoro, Ilisso.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>60</volume>
          (
          <issue>21 content words</issue>
          )
          <volume>43</volume>
          (
          <issue>14 content words</issue>
          )
          <volume>54</volume>
          (
          <issue>14 content words</issue>
          )
          <volume>47</volume>
          (
          <issue>15 content words</issue>
          )
          <volume>64</volume>
          (
          <issue>16 content words</issue>
          )
          <volume>34</volume>
          (
          <issue>12 content words</issue>
          )
          <volume>32</volume>
          (
          <issue>11 content words</issue>
          )
          <volume>44</volume>
          (
          <issue>17 content words</issue>
          )
          <volume>39</volume>
          (
          <issue>17 content words</issue>
          )
          <volume>64</volume>
          (
          <issue>26 content words</issue>
          )
          <volume>65</volume>
          (
          <issue>33 content words</issue>
          )
          <volume>82</volume>
          (
          <issue>27 content words</issue>
          )
          <article-title>1 to 2 1 to 3 1 to 4</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>