Linghub: a Linked Data Based Portal Supporting the Discovery of Language Resources John P. McCrae1,2 Philipp Cimiano2 1 2 Insight Centre for Data Analytics, National Cognitive Interaction Technology, Cluster of University of Ireland, Galway Excellence, Bielefeld University Galway, Ireland Bielefeld, Germany john@mccr.ae cimiano@cit-ec.uni-bielefeld.de ABSTRACT such as META-SHARE [7] or CLARIN project’s Virtual Language Language resources are an essential component of any natural lan- Observatory [17, VLO]. This approach is characterized though high- guage processing system and such systems can only be applied to quality metadata that are entered by experts, at the expense of cov- new languages and domains if appropriate resources can be found. erage. A collaborative approach, on the other hand, allows any- Currently the task of finding new language resources for a par- one to publish language resource metadata. Examples of this are ticular task or application is complicated by the fact that records the LREMap [4] or Datahub1 . A process for controlling the qual- about such resources are stored in different repositories with differ- ity of metadata entered is typically lacking for such collaborative ent models, different quality and search mechanisms. To remedy repositories, leading to less qualitative metadata and inhomoge- this situation, we present Linghub, a new portal that aggregates and neous metadata resulting from free-text fields, user-provided tags indexes data from a range of sources and repositories and applied and the lack of controlled vocabularies. the Linked Data Principles to expose all the metadata under a com- mon interface. Furthermore, we use faceted browsing and SPARQL Given the nature of this difference we wish to make data available queries to show how this can help to answer real user problems ex- from multiple sources in a homogeneous manner and we saw the tracted from a mailing list for linguists. development of a new linked data portal as the primary method to achieve this. To this end we adopted a model based on the DCAT data model [10] along with properties from Dublin Core [9]. In Keywords addition, we used the RDF version [12] of the META-SHARE Linked Data, Language Resources, SPARQL, Faceted Browsing model [8] to provide for metadata properties that are specific to language data and linguistic research. As such, in this paper we 1. INTRODUCTION describe the creation of the largest collection of information about Language resources are essential for nearly all tasks in natural lan- language resources and briefly describe its publication on the Web guage processing (NLP) and in particular for the adaptation of re- by means of linked data principles. sources and methods to new domains and languages. In order to use language resources for new purposes they must first be discov- The rest of the paper is structured as follows: firstly, we will de- ered and this can only be done if there is a comprehensive list of all scribe related work in Section 2, then we will describe the collec- resources that may be available. To this there have been a number tion and processing of data in Section 3. Next, we will describe of projects that have attempted to collect such a catalogue using the portal and how we envision users can access the data in Sec- various methods and with differing degrees of data quality. We tion 4 and examine how real user queries could be answered with present a new portal, Linghub, that aims to integrate all these data Linghub in Section 5. Finally we conclude in Section 6. from different sources by means of linked data and thus to create a website, whereby all information about language resources can be included and queried using a common methodology. The goal 2. RELATED WORK of Linghub is thus to enable wider discovery of language resources There have been several attempts to collect metadata about lan- for researchers in NLP, computational linguistics and linguistics. guage resources mostly associated with large infrastructure projects. CLARIN has been collecting resources under a project called the Currently, two approaches to metadata collection for language re- Virtual Language Observatory [17], using the Component Meta- sources can be distinguished. Firstly, we distinguish a curatorial data Infrastructure [3, CMDI] to collect common metadata values approach to metadata collection in which a repository of language from multiple sources. A similar project is META-SHARE [14] resource metadata is maintained by a cross-institution organization from the META-NET project where language resources are col- lected and high-quality, manual entries are created for each record. Similarly, the Open Languages Archives Community [2, OLAC] collects data from a number of sources although the metadata col- lected is not itself open. Another related project called SHACHI has also collected some metadata [16]. There has also been an at- tempt to track language resources by means of assigning them an International Standard Language Resource Number (ISLRN) simi- lar to an ISBN used to track books [5]. 1 http://datahub.io 88 Source Records Triples repositories one metadata record is created for each language a re- Datahub 185 10,739 source is available in (this is the case for CLARIN for instance LRE-Map 682 10,650 and represents 35.0% of all resources). In order to remove these META-SHARE 2,442 464,572 duplications we used state-of-the-art word sense disambiguation CLARIN VLO 144,138 3,605,196 techniques, including Babelfy [13] to identify common controlled All 147,447 4,091,157 vocabularies and duplicate entries. For the case of properties we mapped to several existing resources, including LexVo [6] for lan- Table 1: Size of Linghub datasets by source guages, and BabelNet for resource types. Duplicate entries were not removed from the dataset but instead were marked with the ad- dition of the Dublin Core property is replaced by. In the case that On the contrary some resources have instead collected data directly these entries were subsets of resources the target of this link would from creators of the resources, for example the LRE-Map [4] col- be a new combined record for the entire resource and in the case lects data from authors of papers submitted to conference, such as of duplicate records collected from distinct sources we referred to LREC. Similarly, Datahub collects resources directly from those the most complete triple, that is the record with the most triples. submitted to the website, but focusses primarily on linked data re- The harmonization and description is described in more detail in sources. McCrae et al. [11]. 3. DATASET Currently, there is no direct method for users to provide metadata to In order to ensure that all the data from many sources can be queried the repository, however it is foreseen that users could submit valid in a homogenous manner we made sure that the metadata from all DCAT files to Linghub. We do note that Datahub allows any user the repositories mentioned in Table 1 was available as RDF. In do- to submit a dataset and such datasets will quickly be picked up by ing this, we aligned the proprietary schemas used in these repos- Linghub and added to the repository in this manner. itories to well-known semantic Web vocabularies and fixed exist- ing modeling errors, such as using percent-encoded URIs for titles of resources or introducing URL links that would never resolve. 4. THE LINGHUB PORTAL Two of our resources, LRE-Map and Datahub, were already avail- In order to enable users to quickly and easily discover datasets, able in RDF, so that the conversion mainly involved developing an we set up a portal for browsing the dataset. Naturally we set this appropriate URL schema so that datasets were uniquely identified up as a site that publishes the individual records as either RDF or and thus to avoid collisions when uploading data into the Linghub HTML, with the actual content delivered to the client decided by portal. A number of quality issues were also fixed in doing this means of content negotiation. We developed templates that render transformation, such as deciding whether property values should be the RDF in a readable manner, while still appearing close to the data literals or URIs, reducing the number of blank nodes and reusing in such a way that users would get a consistent view of a dataset existing metadata vocabularies such as VoID [1]. record even if it came from a different original source and hence had very different properties. In addition, we provide a number of The other resources (CLARIN VLO and META-SHARE) were avail- mechanisms by which users and automated agents can discover a able in XML. We developed a custom converter for each of these dataset. For users, we allowed resources to be discovered by means resources building on a transformation language similar to XSLT, of faceted browsing by enabling users to select properties and their which we developed. For META-SHARE, this was a challenging values. We fixed the list of properties in advance to those that have task as there were nearly a thousand unique tags defined and each been harmonized so as to not overload the user with choices for one was examined to see if it was similar to an existing Seman- properties that only occur for a few datasets and also to enable the tic Web vocabulary, and in fact we ended up mapping to FOAF 2 , compilation of indexes to speed up page load times. In addition the SWRC 3 and the Media Ontology 4 . In the case of CLARIN, there front page of Linghub contains a free-text search engine allowing was actually a significant difference between the XML schemas the users to query fields by a property. This free-text search engine used by each contributing instance, with only a small common sec- is powered by a separate index which includes not only the text tion giving the resource title and download link. We thus developed of data properties but also the labels of URIs which appear as the distinct mappings for the largest 5 institutes. value of object properties. Machine-based agents may access the endpoint by means of SPARQL querying, although the endpoint Two key issues emerge when collecting data from a heterogenous limits the agents to a subset of the SPARQL query language. The set of sources such as we are doing. Firstly, the data is likely to goal of this is to enable constant query-time without overloading be noisy and inconsistent in the properties it uses and more impor- our server. The nature of SPARQL makes it very easy for users tantly in the values that these properties have. For example, lan- to write queries that are of a complexity that would not be easy to guages may be represented by their English names or alternatively answer and other sites have attempted to handle this by enforcing by means of the codes such as the ISO 639 codes5 . timeouts on SPARQL queries. In general we find this solution to be sub-optimal as it means that queries may fail unpredictably if Secondly, it happens relatively frequently that dataset descriptions the server has many concurrent connections. Instead, we limit the are duplicate as they are contained in multiple source repositories complexity of the queries themselves by requiring that the triples (currently this affects 5.0% of resources). Furthermore, also intra- have certain properties that can be easily answered. These include: repository duplicates exist, resulting from the fact that in some 2 http://xmlns.com/foaf/spec/ 3 http://ontoware.org/swrc/ 1. A required limit on the number of results; 4 http://www.w3.org/TR/mediaont-10/ 5 2. The property may not be a variable, thus limiting the number http://www.iso.org/iso/home/standards/ language_codes.htm of results; 89 Figure 1: A screenshot of the Linghub interface 3. The query must be a ‘tree’ in that every triples should be can search for both language and subject with the following connected from a single root node. query7 : SELECT ?resource WHERE { Furthermore, the SPARQL endpoint also by default returns SPARQL- ?resource JSON results[15], so that the results may be easily applied. This is dct:language iso639:ibo | based on the fact that many clients, notably client-side Javascript dc:subject "Igbo" ; in browsers, will not accept XML due to security concerns. Other dct:type metashare:corpus . clients may still obtain SPARQL-XML by supplying the appropri- } ate header or parameter in the query. 2. “I am looking for a Lithuanian gigaword corpus for a re- search project.” (Márton Makrai, Feb. 24th 20158 ) 5. USE CASES As a proof-of-concept for Linghub, we discuss a number of realistic Finding a corpus for a European language such as Lithuanian use cases that demonstrate the type of queries that can be answered is generally not a challenge, however this user also has the using Linghub. In order to get realistic use cases, we collected requirement that the resource has over one billion words. We queries for language resources from the Corpora List, a mailing list can easily use the META-SHARE properties to return the used by researchers in corpus linguistics to discuss corpora. From user a list of corpora with their associated sizes, as follows: questions posed in February 2015, 3 queries are considered below as they are clear and well-stated questions that would have feasible SELECT ?resource ?size WHERE { answers. We chose these queries to provide an illustrative example ?resource of queries that can be directly answered using the Linghub portal, ms:corpusInfo [ while discarding many other questions that were vague, unclearly ms:languageInfo [ stated or misused linguistic terminology. We discuss these queries dct:language iso639:lit ; and show how they can be formalized as SPARQL queries against ms:sizePerLanguage [ Linghub and discuss whether reasonable answers were retrieved. ms:size ?size ; ms:sizeUnit ms:words ] ] 1. “[...] desparately needs an Igbo corpus.” (Thapelo J. Otlo- ] . getswe, Feb. 5th 20156 ) } Igbo is a language of Nigeria and Equatorial Guinea and is identified with the language code ibo. Simply typing “Igbo” Unfortunately, the results of this query show that no resource into the search interface of Linghub finds a number of re- in Linghub is over one billion words in size for Lithuanian. sources that could be used. For many of these resources Igbo is the value of the Dublin Core property subject. Although, 3. “I am looking for freely available geotagged tweets collec- there is a language property some sources decided not to use tion for research purpose.” (Md. Hasanuzzaman, Feb. 16th this Dublin Core category. In addition these resources are 20159 ) 7 marked with a type that is mapped to the META-SHARE Note: We have implemented some syntactic extensions to corpus individual even though the resources do not orig- SPARQL. The | operator is a UNION with the same subject. 8 inate from META-SHARE due to our harmonization. We http://mailman.uib.no/public/corpora/ 2015-February/022103.html 6 9 http://mailman.uib.no/public/corpora/ http://mailman.uib.no/public/corpora/ 2015-February/021993.html 2015-February/022044.html 90 Several of the search terms here are unfortunately not found and technical aspects. In Proceedings of the 8th International anywhere in our data, namely ‘geotagged’ and ‘tweets’. It Conference on Language Resources and Evaluation, pages would still be possible for this query to be answered by look- 50–54, 2012. ing at related keywords such as ‘Twitter’, and other aspects [6] G. de Melo. Lexvo.org: Language-related information for the of this query can be handled (e.g., ‘for research purpose’), linguistic linked data cloud. Semantic Web, page 7, 2013. can be handled by means of the META-SHARE vocabulary. [7] C. Federmann, I. Giannopoulou, C. Girardi, O. Hamon, D. Mavroeidis, S. Minutoli, and M. Schröder. META-SHARE v2: An open network of repositories for In summary, we saw that in two of the three cases the users’ need language resources including data and tools. In Proceedings could be clearly expressed as a SPARQL query and that in one of of the 8th International Conference on Language Resources those cases, the query would return an answer as required, in the and Evaluation, pages 3300–3303, 2012. second case no suitable dataset is recorded. In the final case, the [8] M. Gavrilidou, P. Labropoulou, E. Desipri, S. Piperidis, user’s query does not match the structured data found in Linghub, H. Papageorgiou, M. Monachini, F. Frontini, T. Declerck, but related resources can be found by using free text search. As G. Francopoulo, V. Arranz, et al. The META-SHARE such, we see that Linghub enables users to better find their re- metadata schema for the description of language resources. sources than with previous approaches, although it is still not sat- In Proceedings of the 8th International Conference on isfactory for all user queries. In particular, the crucial defect in the Language Resources and Evaluation, pages 1090–1097, final query is that there is no specific metadata that would indicate 2012. if a resource is from a social media site or not, and this would re- [9] J. Kunze and T. Baker. The Dublin Core metadata element quire deeper understanding of the textual components of resource set. RFC 5013, Internet Engineering Task Force, 1997. descriptions to better handle. [10] F. Maali, J. Erickson, and P. Archer. Data catalog vocabulary (DCAT). W3C recommendation, The World Wide Web 6. CONCLUSION Consortium, 2014. Linghub is a new site that collects data from a large number of [11] J. P. McCrae, P. Cimiano, V. Rodriǵuez Doncel, sources and makes it queriable through a common mechanisms. D. Vila-Suero, J. Gracia, L. Matteis, R. Navigli, A. Abele, Furthermore, the data has not only been converted to RDF it has G. Vulcu, and P. Buitelaar. Reconciling heterogeneous also been homogenized and linked to other bubbles in the Linguis- descriptions of language resources. In Proceedings of the 4th tic Linked Open Data Cloud. As such, this resource is likely to pay Workshop on Linked Data in Linguisitcs, 2015. a pivotal role in enabling not only humans but also software agents [12] J. P. McCrae, P. Labrapoulou, J. Gracia, M. Villegas, V. R. to find new resources and use them for applications in natural lan- Doncel, and P. Cimiano. One ontology to bind them all: The guage processing and artificial intelligence. META-SHARE OWL ontology for the interoperability of linguistic datasets on the Web. In Proceedings of the 4th Acknowledgments Workshop on the Multilingual Semantic Web, 2015. This work has been funded by the LIDER project funded under the [13] A. Moro, A. Raganato, and R. Navigli. Entity linking meets European Commission Seventh Framework Program (FP7-610782), word sense disambiguation: a unified approach. Transactions the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ of the Association for Computational Linguistics (TACL), (EXC 277) at Bielefeld University, which is funded by the German 2:231–244, 2014. Research Foundation (DFG), and the Insight Centre for Data An- [14] S. Piperidis. The META-SHARE language resources sharing alytics which is funded by the Science Foundation Ireland under infrastructure: Principles, challenges, solutions. In Grant Number SFI/12/RC/2289. Proceedings of the 8th International Conference on Language Resources and Evaluation, pages 36–42, 2012. 7. REFERENCES [15] A. Seaborne, K. G. Clark, L. Feigenbaum, and E. Torres. [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. SPARQL 1.1 query results JSON format. W3C Describing linked datasets with the VoID vocabulary. recommendation, The World Wide Web Consortium, 2013. Technical report, The World Wide Web Consortium, 2011. [16] H. Tohyama, S. Kozawa, K. Uchimoto, S. Matsubara, and Interest Group Note. H. Isahara. Shachi: A large scale metadata database of [2] S. Bird and G. Simons. Extending dublin core metadata to language resources. In Proceedings of the 1st International support the description and discovery of language resources. Conference on Global Interoperability for Language Computers and the Humanities, 37(4):375–388, 2003. resources, pages 205–212, 2008. [3] D. Broeder, M. Windhouwer, D. Van Uytvanck, T. Goosen, [17] D. Van Uytvanck, H. Stehouwer, and L. Lampen. Semantic and T. Trippel. CMDI: a component metadata infrastructure. metadata mapping in practice: the virtual language In Describing LRs with metadata: towards flexibility and observatory. In Proceedings of the 8th International interoperability in the documentation of LR workshop Conference on Language Resources and Evaluation, pages programme, page 1, 2012. 1029–1034, 2012. [4] N. Calzolari, R. Del Gratta, G. Francopoulo, J. Mariani, F. Rubino, I. Russo, and C. Soria. The LRE Map. Harmonising community descriptions of resources. In Proceedings of the 8th International Conference on Language Resources and Evaluation, pages 1084–1089, 2012. [5] K. Choukri, V. Arranz, O. Hamon, and J. Park. Using the international standard language resource number: Practical 91