=Paper=
{{Paper
|id=Vol-1481/paper27
|storemode=property
|title=Linghub: a Linked Data Based Portal Supporting the Discovery of Language Resources
|pdfUrl=https://ceur-ws.org/Vol-1481/paper27.pdf
|volume=Vol-1481
|dblpUrl=https://dblp.org/rec/conf/i-semantics/McCraeC15
}}
==Linghub: a Linked Data Based Portal Supporting the Discovery of Language Resources==
Linghub: a Linked Data Based Portal Supporting the
Discovery of Language Resources
John P. McCrae1,2 Philipp Cimiano2
1 2
Insight Centre for Data Analytics, National Cognitive Interaction Technology, Cluster of
University of Ireland, Galway Excellence, Bielefeld University
Galway, Ireland Bielefeld, Germany
john@mccr.ae cimiano@cit-ec.uni-bielefeld.de
ABSTRACT such as META-SHARE [7] or CLARIN project’s Virtual Language
Language resources are an essential component of any natural lan- Observatory [17, VLO]. This approach is characterized though high-
guage processing system and such systems can only be applied to quality metadata that are entered by experts, at the expense of cov-
new languages and domains if appropriate resources can be found. erage. A collaborative approach, on the other hand, allows any-
Currently the task of finding new language resources for a par- one to publish language resource metadata. Examples of this are
ticular task or application is complicated by the fact that records the LREMap [4] or Datahub1 . A process for controlling the qual-
about such resources are stored in different repositories with differ- ity of metadata entered is typically lacking for such collaborative
ent models, different quality and search mechanisms. To remedy repositories, leading to less qualitative metadata and inhomoge-
this situation, we present Linghub, a new portal that aggregates and neous metadata resulting from free-text fields, user-provided tags
indexes data from a range of sources and repositories and applied and the lack of controlled vocabularies.
the Linked Data Principles to expose all the metadata under a com-
mon interface. Furthermore, we use faceted browsing and SPARQL Given the nature of this difference we wish to make data available
queries to show how this can help to answer real user problems ex- from multiple sources in a homogeneous manner and we saw the
tracted from a mailing list for linguists. development of a new linked data portal as the primary method to
achieve this. To this end we adopted a model based on the DCAT
data model [10] along with properties from Dublin Core [9]. In
Keywords addition, we used the RDF version [12] of the META-SHARE
Linked Data, Language Resources, SPARQL, Faceted Browsing model [8] to provide for metadata properties that are specific to
language data and linguistic research. As such, in this paper we
1. INTRODUCTION describe the creation of the largest collection of information about
Language resources are essential for nearly all tasks in natural lan- language resources and briefly describe its publication on the Web
guage processing (NLP) and in particular for the adaptation of re- by means of linked data principles.
sources and methods to new domains and languages. In order to
use language resources for new purposes they must first be discov- The rest of the paper is structured as follows: firstly, we will de-
ered and this can only be done if there is a comprehensive list of all scribe related work in Section 2, then we will describe the collec-
resources that may be available. To this there have been a number tion and processing of data in Section 3. Next, we will describe
of projects that have attempted to collect such a catalogue using the portal and how we envision users can access the data in Sec-
various methods and with differing degrees of data quality. We tion 4 and examine how real user queries could be answered with
present a new portal, Linghub, that aims to integrate all these data Linghub in Section 5. Finally we conclude in Section 6.
from different sources by means of linked data and thus to create
a website, whereby all information about language resources can
be included and queried using a common methodology. The goal 2. RELATED WORK
of Linghub is thus to enable wider discovery of language resources There have been several attempts to collect metadata about lan-
for researchers in NLP, computational linguistics and linguistics. guage resources mostly associated with large infrastructure projects.
CLARIN has been collecting resources under a project called the
Currently, two approaches to metadata collection for language re- Virtual Language Observatory [17], using the Component Meta-
sources can be distinguished. Firstly, we distinguish a curatorial data Infrastructure [3, CMDI] to collect common metadata values
approach to metadata collection in which a repository of language from multiple sources. A similar project is META-SHARE [14]
resource metadata is maintained by a cross-institution organization from the META-NET project where language resources are col-
lected and high-quality, manual entries are created for each record.
Similarly, the Open Languages Archives Community [2, OLAC]
collects data from a number of sources although the metadata col-
lected is not itself open. Another related project called SHACHI
has also collected some metadata [16]. There has also been an at-
tempt to track language resources by means of assigning them an
International Standard Language Resource Number (ISLRN) simi-
lar to an ISBN used to track books [5].
1
http://datahub.io
88
Source Records Triples repositories one metadata record is created for each language a re-
Datahub 185 10,739 source is available in (this is the case for CLARIN for instance
LRE-Map 682 10,650 and represents 35.0% of all resources). In order to remove these
META-SHARE 2,442 464,572 duplications we used state-of-the-art word sense disambiguation
CLARIN VLO 144,138 3,605,196 techniques, including Babelfy [13] to identify common controlled
All 147,447 4,091,157 vocabularies and duplicate entries. For the case of properties we
mapped to several existing resources, including LexVo [6] for lan-
Table 1: Size of Linghub datasets by source guages, and BabelNet for resource types. Duplicate entries were
not removed from the dataset but instead were marked with the ad-
dition of the Dublin Core property is replaced by. In the case that
On the contrary some resources have instead collected data directly these entries were subsets of resources the target of this link would
from creators of the resources, for example the LRE-Map [4] col- be a new combined record for the entire resource and in the case
lects data from authors of papers submitted to conference, such as of duplicate records collected from distinct sources we referred to
LREC. Similarly, Datahub collects resources directly from those the most complete triple, that is the record with the most triples.
submitted to the website, but focusses primarily on linked data re- The harmonization and description is described in more detail in
sources. McCrae et al. [11].
3. DATASET Currently, there is no direct method for users to provide metadata to
In order to ensure that all the data from many sources can be queried the repository, however it is foreseen that users could submit valid
in a homogenous manner we made sure that the metadata from all DCAT files to Linghub. We do note that Datahub allows any user
the repositories mentioned in Table 1 was available as RDF. In do- to submit a dataset and such datasets will quickly be picked up by
ing this, we aligned the proprietary schemas used in these repos- Linghub and added to the repository in this manner.
itories to well-known semantic Web vocabularies and fixed exist-
ing modeling errors, such as using percent-encoded URIs for titles
of resources or introducing URL links that would never resolve. 4. THE LINGHUB PORTAL
Two of our resources, LRE-Map and Datahub, were already avail- In order to enable users to quickly and easily discover datasets,
able in RDF, so that the conversion mainly involved developing an we set up a portal for browsing the dataset. Naturally we set this
appropriate URL schema so that datasets were uniquely identified up as a site that publishes the individual records as either RDF or
and thus to avoid collisions when uploading data into the Linghub HTML, with the actual content delivered to the client decided by
portal. A number of quality issues were also fixed in doing this means of content negotiation. We developed templates that render
transformation, such as deciding whether property values should be the RDF in a readable manner, while still appearing close to the data
literals or URIs, reducing the number of blank nodes and reusing in such a way that users would get a consistent view of a dataset
existing metadata vocabularies such as VoID [1]. record even if it came from a different original source and hence
had very different properties. In addition, we provide a number of
The other resources (CLARIN VLO and META-SHARE) were avail- mechanisms by which users and automated agents can discover a
able in XML. We developed a custom converter for each of these dataset. For users, we allowed resources to be discovered by means
resources building on a transformation language similar to XSLT, of faceted browsing by enabling users to select properties and their
which we developed. For META-SHARE, this was a challenging values. We fixed the list of properties in advance to those that have
task as there were nearly a thousand unique tags defined and each been harmonized so as to not overload the user with choices for
one was examined to see if it was similar to an existing Seman- properties that only occur for a few datasets and also to enable the
tic Web vocabulary, and in fact we ended up mapping to FOAF 2 , compilation of indexes to speed up page load times. In addition the
SWRC 3 and the Media Ontology 4 . In the case of CLARIN, there front page of Linghub contains a free-text search engine allowing
was actually a significant difference between the XML schemas the users to query fields by a property. This free-text search engine
used by each contributing instance, with only a small common sec- is powered by a separate index which includes not only the text
tion giving the resource title and download link. We thus developed of data properties but also the labels of URIs which appear as the
distinct mappings for the largest 5 institutes. value of object properties. Machine-based agents may access the
endpoint by means of SPARQL querying, although the endpoint
Two key issues emerge when collecting data from a heterogenous limits the agents to a subset of the SPARQL query language. The
set of sources such as we are doing. Firstly, the data is likely to goal of this is to enable constant query-time without overloading
be noisy and inconsistent in the properties it uses and more impor- our server. The nature of SPARQL makes it very easy for users
tantly in the values that these properties have. For example, lan- to write queries that are of a complexity that would not be easy to
guages may be represented by their English names or alternatively answer and other sites have attempted to handle this by enforcing
by means of the codes such as the ISO 639 codes5 . timeouts on SPARQL queries. In general we find this solution to
be sub-optimal as it means that queries may fail unpredictably if
Secondly, it happens relatively frequently that dataset descriptions the server has many concurrent connections. Instead, we limit the
are duplicate as they are contained in multiple source repositories complexity of the queries themselves by requiring that the triples
(currently this affects 5.0% of resources). Furthermore, also intra- have certain properties that can be easily answered. These include:
repository duplicates exist, resulting from the fact that in some
2
http://xmlns.com/foaf/spec/
3
http://ontoware.org/swrc/ 1. A required limit on the number of results;
4
http://www.w3.org/TR/mediaont-10/
5 2. The property may not be a variable, thus limiting the number
http://www.iso.org/iso/home/standards/
language_codes.htm of results;
89
Figure 1: A screenshot of the Linghub interface
3. The query must be a ‘tree’ in that every triples should be can search for both language and subject with the following
connected from a single root node. query7 :
SELECT ?resource WHERE {
Furthermore, the SPARQL endpoint also by default returns SPARQL- ?resource
JSON results[15], so that the results may be easily applied. This is dct:language iso639:ibo |
based on the fact that many clients, notably client-side Javascript dc:subject "Igbo" ;
in browsers, will not accept XML due to security concerns. Other dct:type metashare:corpus .
clients may still obtain SPARQL-XML by supplying the appropri- }
ate header or parameter in the query.
2. “I am looking for a Lithuanian gigaword corpus for a re-
search project.” (Márton Makrai, Feb. 24th 20158 )
5. USE CASES
As a proof-of-concept for Linghub, we discuss a number of realistic Finding a corpus for a European language such as Lithuanian
use cases that demonstrate the type of queries that can be answered is generally not a challenge, however this user also has the
using Linghub. In order to get realistic use cases, we collected requirement that the resource has over one billion words. We
queries for language resources from the Corpora List, a mailing list can easily use the META-SHARE properties to return the
used by researchers in corpus linguistics to discuss corpora. From user a list of corpora with their associated sizes, as follows:
questions posed in February 2015, 3 queries are considered below
as they are clear and well-stated questions that would have feasible SELECT ?resource ?size WHERE {
answers. We chose these queries to provide an illustrative example ?resource
of queries that can be directly answered using the Linghub portal, ms:corpusInfo [
while discarding many other questions that were vague, unclearly ms:languageInfo [
stated or misused linguistic terminology. We discuss these queries dct:language iso639:lit ;
and show how they can be formalized as SPARQL queries against ms:sizePerLanguage [
Linghub and discuss whether reasonable answers were retrieved. ms:size ?size ;
ms:sizeUnit ms:words
]
]
1. “[...] desparately needs an Igbo corpus.” (Thapelo J. Otlo-
] .
getswe, Feb. 5th 20156 )
}
Igbo is a language of Nigeria and Equatorial Guinea and is
identified with the language code ibo. Simply typing “Igbo” Unfortunately, the results of this query show that no resource
into the search interface of Linghub finds a number of re- in Linghub is over one billion words in size for Lithuanian.
sources that could be used. For many of these resources Igbo
is the value of the Dublin Core property subject. Although, 3. “I am looking for freely available geotagged tweets collec-
there is a language property some sources decided not to use tion for research purpose.” (Md. Hasanuzzaman, Feb. 16th
this Dublin Core category. In addition these resources are 20159 )
7
marked with a type that is mapped to the META-SHARE Note: We have implemented some syntactic extensions to
corpus individual even though the resources do not orig- SPARQL. The | operator is a UNION with the same subject.
8
inate from META-SHARE due to our harmonization. We http://mailman.uib.no/public/corpora/
2015-February/022103.html
6 9
http://mailman.uib.no/public/corpora/ http://mailman.uib.no/public/corpora/
2015-February/021993.html 2015-February/022044.html
90
Several of the search terms here are unfortunately not found and technical aspects. In Proceedings of the 8th International
anywhere in our data, namely ‘geotagged’ and ‘tweets’. It Conference on Language Resources and Evaluation, pages
would still be possible for this query to be answered by look- 50–54, 2012.
ing at related keywords such as ‘Twitter’, and other aspects [6] G. de Melo. Lexvo.org: Language-related information for the
of this query can be handled (e.g., ‘for research purpose’), linguistic linked data cloud. Semantic Web, page 7, 2013.
can be handled by means of the META-SHARE vocabulary. [7] C. Federmann, I. Giannopoulou, C. Girardi, O. Hamon,
D. Mavroeidis, S. Minutoli, and M. Schröder.
META-SHARE v2: An open network of repositories for
In summary, we saw that in two of the three cases the users’ need language resources including data and tools. In Proceedings
could be clearly expressed as a SPARQL query and that in one of of the 8th International Conference on Language Resources
those cases, the query would return an answer as required, in the and Evaluation, pages 3300–3303, 2012.
second case no suitable dataset is recorded. In the final case, the [8] M. Gavrilidou, P. Labropoulou, E. Desipri, S. Piperidis,
user’s query does not match the structured data found in Linghub, H. Papageorgiou, M. Monachini, F. Frontini, T. Declerck,
but related resources can be found by using free text search. As G. Francopoulo, V. Arranz, et al. The META-SHARE
such, we see that Linghub enables users to better find their re- metadata schema for the description of language resources.
sources than with previous approaches, although it is still not sat- In Proceedings of the 8th International Conference on
isfactory for all user queries. In particular, the crucial defect in the Language Resources and Evaluation, pages 1090–1097,
final query is that there is no specific metadata that would indicate 2012.
if a resource is from a social media site or not, and this would re-
[9] J. Kunze and T. Baker. The Dublin Core metadata element
quire deeper understanding of the textual components of resource
set. RFC 5013, Internet Engineering Task Force, 1997.
descriptions to better handle.
[10] F. Maali, J. Erickson, and P. Archer. Data catalog vocabulary
(DCAT). W3C recommendation, The World Wide Web
6. CONCLUSION Consortium, 2014.
Linghub is a new site that collects data from a large number of [11] J. P. McCrae, P. Cimiano, V. Rodriǵuez Doncel,
sources and makes it queriable through a common mechanisms. D. Vila-Suero, J. Gracia, L. Matteis, R. Navigli, A. Abele,
Furthermore, the data has not only been converted to RDF it has G. Vulcu, and P. Buitelaar. Reconciling heterogeneous
also been homogenized and linked to other bubbles in the Linguis- descriptions of language resources. In Proceedings of the 4th
tic Linked Open Data Cloud. As such, this resource is likely to pay Workshop on Linked Data in Linguisitcs, 2015.
a pivotal role in enabling not only humans but also software agents [12] J. P. McCrae, P. Labrapoulou, J. Gracia, M. Villegas, V. R.
to find new resources and use them for applications in natural lan- Doncel, and P. Cimiano. One ontology to bind them all: The
guage processing and artificial intelligence. META-SHARE OWL ontology for the interoperability of
linguistic datasets on the Web. In Proceedings of the 4th
Acknowledgments Workshop on the Multilingual Semantic Web, 2015.
This work has been funded by the LIDER project funded under the [13] A. Moro, A. Raganato, and R. Navigli. Entity linking meets
European Commission Seventh Framework Program (FP7-610782), word sense disambiguation: a unified approach. Transactions
the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ of the Association for Computational Linguistics (TACL),
(EXC 277) at Bielefeld University, which is funded by the German 2:231–244, 2014.
Research Foundation (DFG), and the Insight Centre for Data An- [14] S. Piperidis. The META-SHARE language resources sharing
alytics which is funded by the Science Foundation Ireland under infrastructure: Principles, challenges, solutions. In
Grant Number SFI/12/RC/2289. Proceedings of the 8th International Conference on
Language Resources and Evaluation, pages 36–42, 2012.
7. REFERENCES [15] A. Seaborne, K. G. Clark, L. Feigenbaum, and E. Torres.
[1] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. SPARQL 1.1 query results JSON format. W3C
Describing linked datasets with the VoID vocabulary. recommendation, The World Wide Web Consortium, 2013.
Technical report, The World Wide Web Consortium, 2011. [16] H. Tohyama, S. Kozawa, K. Uchimoto, S. Matsubara, and
Interest Group Note. H. Isahara. Shachi: A large scale metadata database of
[2] S. Bird and G. Simons. Extending dublin core metadata to language resources. In Proceedings of the 1st International
support the description and discovery of language resources. Conference on Global Interoperability for Language
Computers and the Humanities, 37(4):375–388, 2003. resources, pages 205–212, 2008.
[3] D. Broeder, M. Windhouwer, D. Van Uytvanck, T. Goosen, [17] D. Van Uytvanck, H. Stehouwer, and L. Lampen. Semantic
and T. Trippel. CMDI: a component metadata infrastructure. metadata mapping in practice: the virtual language
In Describing LRs with metadata: towards flexibility and observatory. In Proceedings of the 8th International
interoperability in the documentation of LR workshop Conference on Language Resources and Evaluation, pages
programme, page 1, 2012. 1029–1034, 2012.
[4] N. Calzolari, R. Del Gratta, G. Francopoulo, J. Mariani,
F. Rubino, I. Russo, and C. Soria. The LRE Map.
Harmonising community descriptions of resources. In
Proceedings of the 8th International Conference on
Language Resources and Evaluation, pages 1084–1089,
2012.
[5] K. Choukri, V. Arranz, O. Hamon, and J. Park. Using the
international standard language resource number: Practical
91