Linghub: a Linked Data Based Portal Supporting the
                Discovery of Language Resources

                           John P. McCrae1,2                                                  Philipp Cimiano2
            1                                                                  2
                Insight Centre for Data Analytics, National                        Cognitive Interaction Technology, Cluster of
                      University of Ireland, Galway                                    Excellence, Bielefeld University
                             Galway, Ireland                                                   Bielefeld, Germany
                             john@mccr.ae                                            cimiano@cit-ec.uni-bielefeld.de

ABSTRACT                                                                      such as META-SHARE [7] or CLARIN project’s Virtual Language
Language resources are an essential component of any natural lan-             Observatory [17, VLO]. This approach is characterized though high-
guage processing system and such systems can only be applied to               quality metadata that are entered by experts, at the expense of cov-
new languages and domains if appropriate resources can be found.              erage. A collaborative approach, on the other hand, allows any-
Currently the task of finding new language resources for a par-               one to publish language resource metadata. Examples of this are
ticular task or application is complicated by the fact that records           the LREMap [4] or Datahub1 . A process for controlling the qual-
about such resources are stored in different repositories with differ-        ity of metadata entered is typically lacking for such collaborative
ent models, different quality and search mechanisms. To remedy                repositories, leading to less qualitative metadata and inhomoge-
this situation, we present Linghub, a new portal that aggregates and          neous metadata resulting from free-text fields, user-provided tags
indexes data from a range of sources and repositories and applied             and the lack of controlled vocabularies.
the Linked Data Principles to expose all the metadata under a com-
mon interface. Furthermore, we use faceted browsing and SPARQL                Given the nature of this difference we wish to make data available
queries to show how this can help to answer real user problems ex-            from multiple sources in a homogeneous manner and we saw the
tracted from a mailing list for linguists.                                    development of a new linked data portal as the primary method to
                                                                              achieve this. To this end we adopted a model based on the DCAT
                                                                              data model [10] along with properties from Dublin Core [9]. In
Keywords                                                                      addition, we used the RDF version [12] of the META-SHARE
Linked Data, Language Resources, SPARQL, Faceted Browsing                     model [8] to provide for metadata properties that are specific to
                                                                              language data and linguistic research. As such, in this paper we
1.    INTRODUCTION                                                            describe the creation of the largest collection of information about
Language resources are essential for nearly all tasks in natural lan-         language resources and briefly describe its publication on the Web
guage processing (NLP) and in particular for the adaptation of re-            by means of linked data principles.
sources and methods to new domains and languages. In order to
use language resources for new purposes they must first be discov-            The rest of the paper is structured as follows: firstly, we will de-
ered and this can only be done if there is a comprehensive list of all        scribe related work in Section 2, then we will describe the collec-
resources that may be available. To this there have been a number             tion and processing of data in Section 3. Next, we will describe
of projects that have attempted to collect such a catalogue using             the portal and how we envision users can access the data in Sec-
various methods and with differing degrees of data quality. We                tion 4 and examine how real user queries could be answered with
present a new portal, Linghub, that aims to integrate all these data          Linghub in Section 5. Finally we conclude in Section 6.
from different sources by means of linked data and thus to create
a website, whereby all information about language resources can
be included and queried using a common methodology. The goal                  2.      RELATED WORK
of Linghub is thus to enable wider discovery of language resources            There have been several attempts to collect metadata about lan-
for researchers in NLP, computational linguistics and linguistics.            guage resources mostly associated with large infrastructure projects.
                                                                              CLARIN has been collecting resources under a project called the
Currently, two approaches to metadata collection for language re-             Virtual Language Observatory [17], using the Component Meta-
sources can be distinguished. Firstly, we distinguish a curatorial            data Infrastructure [3, CMDI] to collect common metadata values
approach to metadata collection in which a repository of language             from multiple sources. A similar project is META-SHARE [14]
resource metadata is maintained by a cross-institution organization           from the META-NET project where language resources are col-
                                                                              lected and high-quality, manual entries are created for each record.
                                                                              Similarly, the Open Languages Archives Community [2, OLAC]
                                                                              collects data from a number of sources although the metadata col-
                                                                              lected is not itself open. Another related project called SHACHI
                                                                              has also collected some metadata [16]. There has also been an at-
                                                                              tempt to track language resources by means of assigning them an
                                                                              International Standard Language Resource Number (ISLRN) simi-
                                                                              lar to an ISBN used to track books [5].
                                                                              1
                                                                                  http://datahub.io


                                                                         88
         Source                      Records      Triples                    repositories one metadata record is created for each language a re-
         Datahub                       185        10,739                     source is available in (this is the case for CLARIN for instance
         LRE-Map                       682        10,650                     and represents 35.0% of all resources). In order to remove these
         META-SHARE                   2,442       464,572                    duplications we used state-of-the-art word sense disambiguation
         CLARIN VLO                  144,138     3,605,196                   techniques, including Babelfy [13] to identify common controlled
         All                         147,447     4,091,157                   vocabularies and duplicate entries. For the case of properties we
                                                                             mapped to several existing resources, including LexVo [6] for lan-
          Table 1: Size of Linghub datasets by source                        guages, and BabelNet for resource types. Duplicate entries were
                                                                             not removed from the dataset but instead were marked with the ad-
                                                                             dition of the Dublin Core property is replaced by. In the case that
On the contrary some resources have instead collected data directly          these entries were subsets of resources the target of this link would
from creators of the resources, for example the LRE-Map [4] col-             be a new combined record for the entire resource and in the case
lects data from authors of papers submitted to conference, such as           of duplicate records collected from distinct sources we referred to
LREC. Similarly, Datahub collects resources directly from those              the most complete triple, that is the record with the most triples.
submitted to the website, but focusses primarily on linked data re-          The harmonization and description is described in more detail in
sources.                                                                     McCrae et al. [11].

3.    DATASET                                                                Currently, there is no direct method for users to provide metadata to
In order to ensure that all the data from many sources can be queried        the repository, however it is foreseen that users could submit valid
in a homogenous manner we made sure that the metadata from all               DCAT files to Linghub. We do note that Datahub allows any user
the repositories mentioned in Table 1 was available as RDF. In do-           to submit a dataset and such datasets will quickly be picked up by
ing this, we aligned the proprietary schemas used in these repos-            Linghub and added to the repository in this manner.
itories to well-known semantic Web vocabularies and fixed exist-
ing modeling errors, such as using percent-encoded URIs for titles
of resources or introducing URL links that would never resolve.              4.     THE LINGHUB PORTAL
Two of our resources, LRE-Map and Datahub, were already avail-               In order to enable users to quickly and easily discover datasets,
able in RDF, so that the conversion mainly involved developing an            we set up a portal for browsing the dataset. Naturally we set this
appropriate URL schema so that datasets were uniquely identified             up as a site that publishes the individual records as either RDF or
and thus to avoid collisions when uploading data into the Linghub            HTML, with the actual content delivered to the client decided by
portal. A number of quality issues were also fixed in doing this             means of content negotiation. We developed templates that render
transformation, such as deciding whether property values should be           the RDF in a readable manner, while still appearing close to the data
literals or URIs, reducing the number of blank nodes and reusing             in such a way that users would get a consistent view of a dataset
existing metadata vocabularies such as VoID [1].                             record even if it came from a different original source and hence
                                                                             had very different properties. In addition, we provide a number of
The other resources (CLARIN VLO and META-SHARE) were avail-                  mechanisms by which users and automated agents can discover a
able in XML. We developed a custom converter for each of these               dataset. For users, we allowed resources to be discovered by means
resources building on a transformation language similar to XSLT,             of faceted browsing by enabling users to select properties and their
which we developed. For META-SHARE, this was a challenging                   values. We fixed the list of properties in advance to those that have
task as there were nearly a thousand unique tags defined and each            been harmonized so as to not overload the user with choices for
one was examined to see if it was similar to an existing Seman-              properties that only occur for a few datasets and also to enable the
tic Web vocabulary, and in fact we ended up mapping to FOAF 2 ,              compilation of indexes to speed up page load times. In addition the
SWRC 3 and the Media Ontology 4 . In the case of CLARIN, there               front page of Linghub contains a free-text search engine allowing
was actually a significant difference between the XML schemas                the users to query fields by a property. This free-text search engine
used by each contributing instance, with only a small common sec-            is powered by a separate index which includes not only the text
tion giving the resource title and download link. We thus developed          of data properties but also the labels of URIs which appear as the
distinct mappings for the largest 5 institutes.                              value of object properties. Machine-based agents may access the
                                                                             endpoint by means of SPARQL querying, although the endpoint
Two key issues emerge when collecting data from a heterogenous               limits the agents to a subset of the SPARQL query language. The
set of sources such as we are doing. Firstly, the data is likely to          goal of this is to enable constant query-time without overloading
be noisy and inconsistent in the properties it uses and more impor-          our server. The nature of SPARQL makes it very easy for users
tantly in the values that these properties have. For example, lan-           to write queries that are of a complexity that would not be easy to
guages may be represented by their English names or alternatively            answer and other sites have attempted to handle this by enforcing
by means of the codes such as the ISO 639 codes5 .                           timeouts on SPARQL queries. In general we find this solution to
                                                                             be sub-optimal as it means that queries may fail unpredictably if
Secondly, it happens relatively frequently that dataset descriptions         the server has many concurrent connections. Instead, we limit the
are duplicate as they are contained in multiple source repositories          complexity of the queries themselves by requiring that the triples
(currently this affects 5.0% of resources). Furthermore, also intra-         have certain properties that can be easily answered. These include:
repository duplicates exist, resulting from the fact that in some
2
  http://xmlns.com/foaf/spec/
3
  http://ontoware.org/swrc/                                                       1. A required limit on the number of results;
4
  http://www.w3.org/TR/mediaont-10/
5                                                                                 2. The property may not be a variable, thus limiting the number
  http://www.iso.org/iso/home/standards/
language_codes.htm                                                                   of results;


                                                                        89
                                              Figure 1: A screenshot of the Linghub interface


     3. The query must be a ‘tree’ in that every triples should be               can search for both language and subject with the following
        connected from a single root node.                                       query7 :

                                                                                 SELECT ?resource WHERE {
Furthermore, the SPARQL endpoint also by default returns SPARQL-                   ?resource
JSON results[15], so that the results may be easily applied. This is                 dct:language iso639:ibo |
based on the fact that many clients, notably client-side Javascript                  dc:subject "Igbo" ;
in browsers, will not accept XML due to security concerns. Other                     dct:type metashare:corpus .
clients may still obtain SPARQL-XML by supplying the appropri-                   }
ate header or parameter in the query.
                                                                              2. “I am looking for a Lithuanian gigaword corpus for a re-
                                                                                 search project.” (Márton Makrai, Feb. 24th 20158 )
5.     USE CASES
As a proof-of-concept for Linghub, we discuss a number of realistic              Finding a corpus for a European language such as Lithuanian
use cases that demonstrate the type of queries that can be answered              is generally not a challenge, however this user also has the
using Linghub. In order to get realistic use cases, we collected                 requirement that the resource has over one billion words. We
queries for language resources from the Corpora List, a mailing list             can easily use the META-SHARE properties to return the
used by researchers in corpus linguistics to discuss corpora. From               user a list of corpora with their associated sizes, as follows:
questions posed in February 2015, 3 queries are considered below
as they are clear and well-stated questions that would have feasible             SELECT ?resource ?size WHERE {
answers. We chose these queries to provide an illustrative example                 ?resource
of queries that can be directly answered using the Linghub portal,                   ms:corpusInfo [
while discarding many other questions that were vague, unclearly                       ms:languageInfo [
stated or misused linguistic terminology. We discuss these queries                       dct:language iso639:lit ;
and show how they can be formalized as SPARQL queries against                            ms:sizePerLanguage [
Linghub and discuss whether reasonable answers were retrieved.                             ms:size ?size ;
                                                                                           ms:sizeUnit ms:words
                                                                                         ]
                                                                                       ]
     1. “[...] desparately needs an Igbo corpus.” (Thapelo J. Otlo-
                                                                                     ] .
        getswe, Feb. 5th 20156 )
                                                                                 }
        Igbo is a language of Nigeria and Equatorial Guinea and is
        identified with the language code ibo. Simply typing “Igbo”              Unfortunately, the results of this query show that no resource
        into the search interface of Linghub finds a number of re-               in Linghub is over one billion words in size for Lithuanian.
        sources that could be used. For many of these resources Igbo
        is the value of the Dublin Core property subject. Although,           3. “I am looking for freely available geotagged tweets collec-
        there is a language property some sources decided not to use             tion for research purpose.” (Md. Hasanuzzaman, Feb. 16th
        this Dublin Core category. In addition these resources are               20159 )
                                                                            7
        marked with a type that is mapped to the META-SHARE                   Note: We have implemented some syntactic extensions to
        corpus individual even though the resources do not orig-            SPARQL. The | operator is a UNION with the same subject.
                                                                            8
        inate from META-SHARE due to our harmonization. We                    http://mailman.uib.no/public/corpora/
                                                                            2015-February/022103.html
6                                                                           9
  http://mailman.uib.no/public/corpora/                                       http://mailman.uib.no/public/corpora/
2015-February/021993.html                                                   2015-February/022044.html


                                                                       90
      Several of the search terms here are unfortunately not found                   and technical aspects. In Proceedings of the 8th International
      anywhere in our data, namely ‘geotagged’ and ‘tweets’. It                      Conference on Language Resources and Evaluation, pages
      would still be possible for this query to be answered by look-                 50–54, 2012.
      ing at related keywords such as ‘Twitter’, and other aspects               [6] G. de Melo. Lexvo.org: Language-related information for the
      of this query can be handled (e.g., ‘for research purpose’),                   linguistic linked data cloud. Semantic Web, page 7, 2013.
      can be handled by means of the META-SHARE vocabulary.                      [7] C. Federmann, I. Giannopoulou, C. Girardi, O. Hamon,
                                                                                     D. Mavroeidis, S. Minutoli, and M. Schröder.
                                                                                     META-SHARE v2: An open network of repositories for
In summary, we saw that in two of the three cases the users’ need                    language resources including data and tools. In Proceedings
could be clearly expressed as a SPARQL query and that in one of                      of the 8th International Conference on Language Resources
those cases, the query would return an answer as required, in the                    and Evaluation, pages 3300–3303, 2012.
second case no suitable dataset is recorded. In the final case, the              [8] M. Gavrilidou, P. Labropoulou, E. Desipri, S. Piperidis,
user’s query does not match the structured data found in Linghub,                    H. Papageorgiou, M. Monachini, F. Frontini, T. Declerck,
but related resources can be found by using free text search. As                     G. Francopoulo, V. Arranz, et al. The META-SHARE
such, we see that Linghub enables users to better find their re-                     metadata schema for the description of language resources.
sources than with previous approaches, although it is still not sat-                 In Proceedings of the 8th International Conference on
isfactory for all user queries. In particular, the crucial defect in the             Language Resources and Evaluation, pages 1090–1097,
final query is that there is no specific metadata that would indicate                2012.
if a resource is from a social media site or not, and this would re-
                                                                                 [9] J. Kunze and T. Baker. The Dublin Core metadata element
quire deeper understanding of the textual components of resource
                                                                                     set. RFC 5013, Internet Engineering Task Force, 1997.
descriptions to better handle.
                                                                                [10] F. Maali, J. Erickson, and P. Archer. Data catalog vocabulary
                                                                                     (DCAT). W3C recommendation, The World Wide Web
6.    CONCLUSION                                                                     Consortium, 2014.
Linghub is a new site that collects data from a large number of                 [11] J. P. McCrae, P. Cimiano, V. Rodriǵuez Doncel,
sources and makes it queriable through a common mechanisms.                          D. Vila-Suero, J. Gracia, L. Matteis, R. Navigli, A. Abele,
Furthermore, the data has not only been converted to RDF it has                      G. Vulcu, and P. Buitelaar. Reconciling heterogeneous
also been homogenized and linked to other bubbles in the Linguis-                    descriptions of language resources. In Proceedings of the 4th
tic Linked Open Data Cloud. As such, this resource is likely to pay                  Workshop on Linked Data in Linguisitcs, 2015.
a pivotal role in enabling not only humans but also software agents             [12] J. P. McCrae, P. Labrapoulou, J. Gracia, M. Villegas, V. R.
to find new resources and use them for applications in natural lan-                  Doncel, and P. Cimiano. One ontology to bind them all: The
guage processing and artificial intelligence.                                        META-SHARE OWL ontology for the interoperability of
                                                                                     linguistic datasets on the Web. In Proceedings of the 4th
Acknowledgments                                                                      Workshop on the Multilingual Semantic Web, 2015.
This work has been funded by the LIDER project funded under the                 [13] A. Moro, A. Raganato, and R. Navigli. Entity linking meets
European Commission Seventh Framework Program (FP7-610782),                          word sense disambiguation: a unified approach. Transactions
the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’                   of the Association for Computational Linguistics (TACL),
(EXC 277) at Bielefeld University, which is funded by the German                     2:231–244, 2014.
Research Foundation (DFG), and the Insight Centre for Data An-                  [14] S. Piperidis. The META-SHARE language resources sharing
alytics which is funded by the Science Foundation Ireland under                      infrastructure: Principles, challenges, solutions. In
Grant Number SFI/12/RC/2289.                                                         Proceedings of the 8th International Conference on
                                                                                     Language Resources and Evaluation, pages 36–42, 2012.
7.    REFERENCES                                                                [15] A. Seaborne, K. G. Clark, L. Feigenbaum, and E. Torres.
 [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao.                          SPARQL 1.1 query results JSON format. W3C
     Describing linked datasets with the VoID vocabulary.                            recommendation, The World Wide Web Consortium, 2013.
     Technical report, The World Wide Web Consortium, 2011.                     [16] H. Tohyama, S. Kozawa, K. Uchimoto, S. Matsubara, and
     Interest Group Note.                                                            H. Isahara. Shachi: A large scale metadata database of
 [2] S. Bird and G. Simons. Extending dublin core metadata to                        language resources. In Proceedings of the 1st International
     support the description and discovery of language resources.                    Conference on Global Interoperability for Language
     Computers and the Humanities, 37(4):375–388, 2003.                              resources, pages 205–212, 2008.
 [3] D. Broeder, M. Windhouwer, D. Van Uytvanck, T. Goosen,                     [17] D. Van Uytvanck, H. Stehouwer, and L. Lampen. Semantic
     and T. Trippel. CMDI: a component metadata infrastructure.                      metadata mapping in practice: the virtual language
     In Describing LRs with metadata: towards flexibility and                        observatory. In Proceedings of the 8th International
     interoperability in the documentation of LR workshop                            Conference on Language Resources and Evaluation, pages
     programme, page 1, 2012.                                                        1029–1034, 2012.
 [4] N. Calzolari, R. Del Gratta, G. Francopoulo, J. Mariani,
     F. Rubino, I. Russo, and C. Soria. The LRE Map.
     Harmonising community descriptions of resources. In
     Proceedings of the 8th International Conference on
     Language Resources and Evaluation, pages 1084–1089,
     2012.
 [5] K. Choukri, V. Arranz, O. Hamon, and J. Park. Using the
     international standard language resource number: Practical


                                                                           91