ShExML: An heterogeneous data mapping language based on ShEx Herminio Garcia-Gonzalez1,2 , Daniel Fernandez-Alvarez1 , and Jose Emilio Labra-Gayo1 1 Department of Computer Science, University of Oviedo, Oviedo, Asturias, Spain herminiogg@gmail.com, danifdezalvarez@gmail.com, labra@uniovi.es 2 Inria Lille Nord Europe, Villeneuve-d’Ascq, France herminio.garcia-gonzalez@inria.fr Abstract. Data interoperability is currently a problem that we are fac- ing more intensely due to the appearance of fields like Big Data or IoT. Many data is persisted in information silos with neither interconnection nor format homogenisation. Our proposal to alleviate this problem is ShExML, a language based on ShEx that can map and merge heteroge- neous data formats into a single RDF representation. We advocate the creation of this type of tools that can facilitate the migration of non- semantic data to the Semantic Web. Keywords: data · interoperability · RDF · ShEx · ShExML 1 Introduction Mapping and merging heterogeneous data sources is a task that has gained in importance throughout the last years. With the improvement of hardware support, the development of new technological areas—such as Big Data or In- ternet of Things (IoT)—and the deeper interconnection between heterogeneous devices, a huge amount of data is generated every second. However, this data is created in various formats and persisted using different technologies. Therefore, understanding and exploitation of this data becomes a hard work due to the information silos model. One of the goals of the Semantic Web was the interconnection of data sources and the avoidance of the aforementioned information silos. Therefore, many tech- nologies were proposed to accompany that objective. However, the migration of non-semantic data to the new semantic technologies is a hard task that many individuals and companies are not able to face due to the time or resources consumption. Migrating all databases in a company to their counterpart in Se- mantic Web world will carry not only the migration of the platforms, but also the data with the development of ad-hoc solutions for every dataset. Therefore, so- lutions that alleviate this translation can contribute to the adoption of semantic technologies or, at least, facilitate it. We propose a language to map and merge heterogeneous data into its Re- source Description Framework (RDF) counterpart. But also taking into account usability and easiness of use. 2 H. Garcia-Gonzalez et al. 2 Related work Many mapping languages and tools were proposed to perform a mapping between a non-semantic format to its RDF counterpart. This is the case of XSPARQL [1] which converts from XML to RDF based on XQuery and SPARQL queries, R2RML [2] which allows to define mappings from relational databases to RDF graphs, or CSV2RDF [4] which permits to convert from CSV to RDF. However, none of these works tackle the mapping and the merging of het- erogeneous datasets in the same solution. This is addressed by RML [3] which extends R2RML language to support formats like JSON, CSV or XML in ad- dition to relational databases. Other alternative is YARRRML [5] a text-based language which is intended to be easy-readable by humans. YARRRML is based on YAML and can be used to represent RML and R2RML rules. ShExML shares the same goal as RML and YARRRML. However, as being based on ShEx, validation of generated data can be done faster, i.e., the gap between ShExML and ShEx is small. Moreover, it is designed to keep the same simplicity and easiness of use that ShEx has. 3 ShExML at a glance ShExML3 is based on ShEx [6] which means that language constructions of ShExML are similar to ShEx. Therefore, it uses the shape as the main foundation for every transformation. Listing 1.1. ShExML example for films PREFIX : < http :// example . com / > PREFIX dbo : < http :// dbpedia . org / ontology / > PREFIX foaf : < http :// xmlns . com / foaf /0.1/ > PREFIX dbr : < http :// dbpedia . org / resource / > SOURCE films_xml < https :// example . com / films . xml > SOURCE films_json < https :// example . com / films . json > QUERY film_ids_xml QUERY fi lm_names _xml QUERY fi lm_years _xml QUERY f i l m _ d i r e c t o r s _ x m l QUERY film_ids_json <$ . films [*]. id > QUERY fi lm _ na me s_ j so n <$ . films [*]. name > QUERY fi lm _ ye ar s_ j so n <$ . films [*]. year > QUERY f i l m _ d i r e c t o r s _ j s o n <$ . films [*]. director > EXPRESSION film_ids < $films_xml . film_ids_xml UNION $films_json . film_ids_json > EXPRESSION film_names < $films_xml . film_n ames_xml UNION $films_json . film_names_json > EXPRESSION film_years < $films_xml . film_y ears_xml UNION $films_json . film_years_json > EXPRESSION fi lm_dire ctors < $films_xml . f i l m _ d i r e c t o r s _ x m l UNION $films_json . film_directors_json > : Films :[ film_ids ] { foaf : name [ film_names ] ; dbo : year dbr :[ film_years ] ; dbo : director [ film_ director s ] ; } We can see ShExML as a combination of declarations followed by a set of shapes. Being the declarations a collection of variable definitions and the shapes the core procedure to define and execute the mappings. 3 ShExML on Github: https://github.com/herminiogg/ShExML ShExML: An heterogeneous data mapping language based on ShEx 3 Inside the set of declarations there are prefixes, sources, queries and expres- sions. Prefixes work as Turtle prefixes; sources allow to define a URL in which the file is hosted; queries are intended to define reusable queries for the pre- viously defined sources (which normally are defined in a query language, e.g., JSONPath or XMLPath); and expressions which are used to perform the queries over a source, make unions among queries and transform them. Listing 1.2. JSON films file Listing 1.3. XML films file { " films ": [ < films > { < film id = " 1 " > " id ": 3 , < name > Dunkirk " name ": " Inception " , < year > 2017 " year ": "2010" , < director > " director ": Christopher Nolan " Christopher Nolan " }, { " id ": 4 , < film id = " 2 " > " name ": " The Prestige " , < name > Interstellar " year ": "2006" , < year > 2014 " director ": < director > " Christopher Nolan " Christopher Nolan } ] } Thus, imagine that we want to make the transformation of two lists of films: one in JSON and the other in XML (see Listings 1.2 and 1.3). We define a ShExML which can convert both files to RDF and merge them into a single RDF file (see Listing 1.1). This conversion has a single shape called :Films which has the main conversion for the films. In order to construct each triple a name is defined under the :[films ids] directive which will match with the subject of every triple generated by this shape. Then, predicates and objects are generated, based on the previous ids, using the expressions enclosed between braces. For example, foaf:name [films name] will generate a triple in the form of subject foaf:name :object. Notice that every expression enclosed between square brackets allows a prefix definition which tells the compiler if this expres- sion will be a node or a literal. Moreover, if a query produces a list of results, instead of a single one, the ShExML engine performs the mapping taking into account the relation of them with each entity. Hence, making it possible to merge files with various entities. Finally, the result of this example is showed in Listing 1.4. Listing 1.4. Result of mapping with ShExML in Turtle format @prefix dbo : < http :// dbpedia . org / ontology / > . @prefix : < http :// example . com / > . @prefix dbr : < http :// dbpedia . org / resource / > . @prefix foaf : < http :// xmlns . com / foaf /0.1/ > . :4 dbo : director " Christopher Nolan " ; dbo : year dbr :2006 ; foaf : name " The Prestige " . :3 dbo : director " Christopher Nolan " ; dbo : year dbr :2010 ; 4 H. Garcia-Gonzalez et al. foaf : name " Inception " . :2 dbo : director " Christopher Nolan " ; dbo : year dbr :2014 ; foaf : name " Interstellar " . :1 dbo : director " Christopher Nolan " ; dbo : year dbr :2017 ; foaf : name " Dunkirk " . 4 Conclusions In this work, we have presented ShExML, a language that allows to map and merge heterogeneous data into its RDF counterpart. This tool helps the migra- tion of semi-structured data to a semantic data format, improving its interoper- ability and searchability. With the development of this solution, the integration of data into the Semantic Web is an easier task and it can be adapted to differ- ent scenarios. We are planning to include some extra features in future versions, such as: the unification of URIs between different representations, the matching between generated URIs and existing ones in the Linked Open Data cloud and the conversion of streaming sources. Acknowledgments This work has been partially funded by the Vicerectorate for Research of the University of Oviedo under the call of ”Plan de Apoyo y Promoción de la Investigación” and by the Ministerio de Economı́a, Industria y Competitividad under the call of ”Programa Estatal de I+D+i Orientada a los Retos de la Sociedad” (project TIN2017-88877-R). References 1. Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between RDF and XML with XSPARQL. Journal on Data Semantics 1(3), 147–185 (2012) 2. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF Mapping Language. https://www.w3.org/TR/r2rml/ (2012), W3C Recommendation 27 September 2012 3. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Heteroge- neous Data. In: LDOW. Seoul, Korea (2014) 4. Ermilov, I., Auer, S., Stadler, C.: CSV2RDF: User-driven CSV to RDF mass con- version framework. In: Proceedings of the ISEM. vol. 13, pp. 04–06. Graz, Austria (2013) 5. Heyvaert, P., De Meester, B., Dimou, A., Verborgh, R.: Declarative Rules for Linked Data Generation at your Fingertips! In: Proceedings of the 15th ESWC: Posters and Demos. Heraklion, Greece (2018) 6. Prud’hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape Expressions: An RDF Validation and Transformation Language. In: Proceedings of the 10th International Conference on Semantic Systems. pp. 32–40. SEM ’14, ACM, New York, NY, USA (2014)