<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Esposito);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Cloud: Phenomenal Cosmic Powers... Itty Bitty Quality Space!</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pasquale Esposito</string-name>
          <email>pasesposito@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Angela Pellegrino</string-name>
          <email>mapellegrino@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vittorio Scarano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Tuozzo</string-name>
          <email>gtuozzo@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Linguistic Linked Open Data, Quality Assessment</institution>
          ,
          <addr-line>Accessibility, Contextual dimensions</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Salerno</institution>
          ,
          <addr-line>via Giovanni Paolo II, 132, 84084 Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The Linguistic Linked Open Data movement aims to model linguistic data according to the Semantic licensing, and accessibility, to identify potentialities and limitations that limit its utility and exploitation.</p>
      </abstract>
      <kwd-group>
        <kwd>Quality Space!</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In the rapidly evolving landscape of linguistic research, the advent of Linguistic Linked Open
Data (LLOD) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has catalyzed a paradigm shift, heralding an era of unprecedented collaboration
and knowledge exchange among linguists and the Semantic Web community. Spearheaded
by pioneering eforts such as the Open Linguistics Working Group [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and initiatives like
LingHub [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], researchers are harnessing the power of linked data to create a vast interconnected
network of linguistic resources. Furthermore, LLOD have been successfully exploited in Natural
Language Processing tasks, demonstrating the potential of LLOD to drive innovations at the
intersection of linguistics and web technologies [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        However, this endeavor is not without its challenges. The LLOD ecosystem is characterized
by a rich tapestry of resources, ranging from traditional linguistic databases to encyclopedic
knowledge bases like DBpedia and Wikidata, leading to heterogeneity in both content and
structure [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Additionally, the accessibility of some resources remains a hurdle, with certain
datasets being unavailable or inadequately represented. As a result, estimating and monitoring
the quality of LLOD is crucial. This poster paper aims to document the quality assessment of
the LLOD Cloud in the direction of identifying potentialities and directions for improvement.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Linguistic Linked Open Data Quality Assessment</title>
      <p>This article reports the quality assessment of the LLOD Cloud in terms of accessibility, use of
open licenses, and amount of data. The LLOD Cloud1 counts more than 200 KGs in June 2024.
From a preliminary evaluation of linguistic datasets modeled according to the LOD principles
and attached to a scientific contribution indexed by Scopus, it resulted that several resources
are still missing to the Cloud. In the direction of confirming it, the LLOD Cloud diagram is
explicitly declared to be an ongoing project inspired by the LOD cloud diagram authored by
Richard Cyganiak and Anja Jentzsch and includes open, available, and interlinked linguistic
resources. As a consequence, it is expected that the diagram will incorporate an increasing
number of resources over time.</p>
      <p>At the current stage, the LLOD Cloud is organized in categories, which are corpora,
lexicons&amp;dictionary, terminologies, thesauri &amp; Knowledge Base (KB), linguistic data categories,
linguistic resource metadata, typological database (DB) and other. Categories are not balanced, as can be
observed in Column # in Table 1. Moreover, we can observe the presence of a consistent portion
of the Cloud categorized as Other. It might raise the question of whether the categorization is
adequate and detailed enough. Further studies are required to verify if the current categories
are aligned with linguistics’ expectations.</p>
      <p>Methodology. Data to perform the quality assessment are retrieved by publicly accessible
pages attached to resources published within the LLOD Cloud. We have downloaded all the
resources in a single JSON file, and we retrieved the title of each KG, keywords to distinguish
the LLOD category used to classify linguistic resources, sparql which report the SPARQL
endpoint, if any, full_download or other_download to retrieve any download format attached
to each resource, triples which model the amount of data, and the license. Besides returning
the link of the SPARQL endpoint and the download format(s), the LLOD cloud also returns the
status of each link. As a result, we can identify all the resources attached to a working link.</p>
      <p>
        Starting from the quality dimensions defined by Zaveri et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the performed analysis
reports the quality assessment of the LLOD Cloud in terms of availability and licensing, belonging
to the accessibility dimensions category, and the amount of data belonging to the contextual
dimensions category. We consider a resource accessible if it is attached to at least a download
format or a working SPARQL endpoint. Scrutinizing all the licenses attached to LLOD, we
manually identified all licenses recognized as Open, such as the Apache License 2, the MIT
license3, or the Creative Commons licenses4. The amount of data dimension is aligned with the
number of triples directly returned from the LLOD Cloud.
      </p>
      <p>Results. The Python script to compute the quality dimensions along with the LLOD.json file
are openly and publicly available online5. Quality assessment results are reported in Table 1
1LLOD Cloud: https://linguistic-lod.org
2Apache License: https://www.apache.org/licenses/LICENSE-2.0
3MIT License: https://opensource.org/license/MIT
4Creative Commons Licenses: https://creativecommons.org/licenses
5GitHub repository: https://github.com/isislab-unisa/LLODCloudQuality
Persistent DOI on Zenodo: https://doi.org/10.5281/zenodo.13449868
clustered per LLOD categories. Per each category, we report:
• column # - the number of LLOD resources in that category;
• column OL - the number of LLOD resources attached to an Open License (OL),
• column A - the number of LLOD resources attached to a working download mechanism,
including to a working SPARQL endpoint;
• column SE - the number of LLOD resources provided with a working SPARQL endpoint,
and naturally a subset of the accessible resources;
• column OL&amp;A - the number of openly accessible LLOD resources, meaning that they have
a working download mechanism and are attached to an OL;
• column OL&amp;SE - the number of LLOD resources openly accessible via a working SPARQL
endpoint;
• data columns reporting the total amount of data (column Tot.), the one accessible via a
working download mechanism (column A), via a working SPARQL endpoint (column SE),
via a working download mechanism and provided with an open license (column OL&amp;A),
and via a working SPARQL endpoint and provided with an open license (column OL&amp;SE)
Discussion &amp; Conclusive Thoughts. LLOD resources are not uniformly distributed over the
categories, as corpora, lexicons&amp;dictionary, and other cover the majority of the LLOD. Less than
half of the LLOD resources are attached to an OL, and in some cases, it is even deprecated, as
happens by using CC BY-NC 2.0 license6 instead of the updated 4.0 version. This overall picture
6CC BY-NC 2.0: https://creativecommons.org/licenses/by-nc/2.0
is almost coherent with each category, with few exceptions, as the corpora category where
60% of resources are attached to an OL, while resources belonging to the linguistic resource
metadata and typological DB categories completely miss OL. 40% of the resources are accessible
via a working download mechanism, but only 5% of them are attached to a working SPARQL
endpoint. The situation is even worse when we focus on openly accessible LLOD resources,
as they drop to 14% while considering any download mechanism, to only four resources if we
need an openly accessible LLOD attached to a working SPARQL endpoint. This results in an
extraordinary amount of data being left untapped. While naming the worst case, LLOD in the
lexicons&amp;dictionary category sums up to 1G of triples, while 322MM can be downloaded, only
4MM can be openly and freely reusable, while no data can be accessed via SPARQL endpoints.
In summary, there is a huge potentiality in the LLOD regarding the variety of resources and
amount of data, but accessibility and the rare use of open licenses hinder their exploitation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nordhof</surname>
          </string-name>
          ,
          <article-title>Linking linguistic resources: Examples from the open linguistics working group</article-title>
          , in: C.
          <string-name>
            <surname>Chiarcos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordhof</surname>
          </string-name>
          , S. Hellmann (Eds.),
          <source>Linked Data in Linguistics: Representing Language Data and Metadata</source>
          , Springer, Heidelberg,
          <year>2012</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Aldabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gracia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Roeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <article-title>Towards a linguistic linked open data cloud: The open</article-title>
          linguistics working group,
          <source>Linked Open Data-Creating Knowledge Out of Interlinked Data</source>
          <volume>948</volume>
          (
          <year>2012</year>
          )
          <fpage>19</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Aldabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gracia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Roeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          , The open linguistics working group,
          <source>Linked Open Data-Creating Knowledge Out of Interlinked Data</source>
          <volume>948</volume>
          (
          <year>2014</year>
          )
          <fpage>19</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nordhof</surname>
          </string-name>
          ,
          <article-title>Linghub: A linked data based portal supporting the discovery of language resources</article-title>
          ,
          <source>in: Proceedings of the ACL 2012 System Demonstrations</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <article-title>Linguistic linked open data for speech processing</article-title>
          ,
          <source>Journal of the International Phonetic Association</source>
          <volume>44</volume>
          (
          <year>2014</year>
          )
          <fpage>103</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <article-title>Ll(o)d and nlp perspectives on semantic change for humanities research</article-title>
          ,
          <source>Journal of Language Technology and Computational Linguistics</source>
          <volume>27</volume>
          (
          <year>2012</year>
          )
          <fpage>21</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nordhof</surname>
          </string-name>
          , Observing lod:
          <article-title>Its knowledge domains and the varying behavior of ontologies across them</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>32</volume>
          (
          <year>2015</year>
          )
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Quality assessment for linked data: A survey, Semantic Web 7 (</article-title>
          <year>2016</year>
          )
          <fpage>63</fpage>
          -
          <lpage>93</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW- 150175.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>