Mapping Spreadsheets to RDF: Supporting Excel in RML Markus Schröder, Christian Jilek, and Andreas Dengel 1 Smart Data & Knowledge Services Dept., DFKI GmbH, Kaiserslautern, Germany 2 Computer Science Dept., TU Kaiserslautern, Germany {markus.schroeder, christian.jilek, andreas.dengel}@dfki.de Abstract. The RDF Mapping Language (RML) enables, among other formats, the mapping of tabular data as Comma-Separated Values (CSV) files to RDF graphs. Unfortunately, the widely used spreadsheet format is currently neglected by its specification and well-known implementa- tions. Therefore, we extended RML Mapper to support Microsoft Excel spreadsheet files and demonstrate its capabilities in an interactive online demo. Our approach allows to access various meta data of spreadsheet cells in typical RML maps. Some experimental features for more specific use cases are also provided. The implementation code is publicly available in a GitHub fork. Keywords: Spreadsheet · Excel · RML · RDF · Knowledge Graph 1 Introduction As soon as knowledge graphs have to be constructed from given data, the use of mapping languages is a common practice. Such languages let users define declarative rules to map various input data formats to a complex interconnected graph. In the Semantic Web community such graphs are usually modeled with the Resource Description Framework (RDF) [10] and (semi-)structured data is mapped with the RDF Mapping Language (RML) [2] – a superset of the W3C-recommended mapping language R2RML [9]. The specification of RML [1] describes how semi-structured data, like XML or JSON, and tables in form of Comma-Separated Values (CSV) can be mapped properly. However, tabular data can also be found in other file types, prominently in spreadsheets. Since the spreadsheet methodology enables a well understood, easy and fast possibility to enter data, they are widely used by knowledge workers, especially in the industrial sector. In contrast to simply structured CSV files, spreadsheets can model complex workbooks containing multiple sheets with cells having rich meta data. Besides its content, a single cell may store additional information such as its appearance (colors, styles and borders) or cell comments. How these cells should Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Markus Schröder, Christian Jilek, and Andreas Dengel be filled or styled by users is not predetermined by spreadsheet applications in general. As a consequence, in practice, cells can contain inconsistent and unstructured content which can be arbitrarily arranged in a sheet. That is why a mapping of such data to RDF can become a challenging task. Although spreadsheets are frequently used in industry, well-known RML-based mappers such as RML Mapper3 , CARML4 , RocketRML [8] or SDM-RDFizer [4] currently do not support them natively. To fix this issue, in this demo paper we extended RML Mapper to support Microsoft Excel spreadsheet files. We have chosen RML Mapper because its code structure lets us easily integrate our new code. Natively supporting spreadsheets in RML has several advantages. There is no need anymore to preprocess and transform spreadsheets in a format that can be handled by a mapping tool (e.g. CSV). Compared to CSV-based processing tools (like csvtk5 ), we are able to exploit all aspects of spreadsheets in a mapping. This also eliminates extraneous conversion efforts for mapping experts and let them focus on defining proper rules. Having RML rules that directly refer to spreadsheet data eases the communication with data providers about how their spreadsheets will be mapped to a knowledge graph. Additionally, practitioners of RML who already learned the language become able to map spreadsheets with almost no extra effort. In summary, this demo paper presents the following contributions to RML: – an implementation to access Excel spreadsheets via an RML logical source in order to iterate (and also filter) cells in sheets – a mechanism to retrieve cell meta data by relative and absolute distances in an RML reference formulation – two experimental features to map from one cell (a) information of different properties and (b) multiple complex entities – a prototypical implementation of the features in RML Mapper In the next section, we describe in detail how we realized the Excel support in RML. 2 Supporting Excel in RML Our approach is implemented as an extension to the RML Mapper tool in a separate GitHub fork6 . For demonstration purpose, we additionally provide an interactive demo page7 where visitors can try various mapping examples. At its core, our component utilizes the Apache POI8 library to read Microsoft Excel spreadsheets as a first step, but it is conceivable in future to support also 3 https://github.com/RMLio/rmlmapper-java 4 https://github.com/carml/carml 5 https://github.com/shenwei356/csvtk 6 https://github.com/mschroeder-github/rmlmapper-java/tree/mschroeder- features 7 http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml 8 https://poi.apache.org Mapping Spreadsheets to RDF: Supporting Excel in RML other spreadsheets such as LibreOffice or Google Spreadsheets by using appropri- ate APIs. In order to refer to spreadsheet contents as a logical source, a small spreadsheet ontology9 (usually prefixed with ss) was designed. As demonstrated in Listing 1.1, using a spreadsheet reference formulation (Line 2), a workbook source should name a spreadsheet file (Line 5), a referring sheet by name (Line 6) and a range of cells in the sheet (Line 7). A triples map that uses this logical source will iterate over single cells in the given range. Listing 1.1: Exemplary RML definition of a spreadsheet file as a logical source. 1 [ a rml : L o g i c a l S o u r c e ; 2 rml : r e f e r e n c e F o r m u l a t i o n q l : S p r e a d s h e e t ; 3 rml : s o u r c e [ 4 a s s : Workbook ; 5 s s : u r l ” workbook . x l s x ” ; 6 s s : sheetName ” P a p e r s ” ; 7 s s : r a n g e ”A2 : A5” ; 8 s s : j a v a S c r i p t F i l t e r ”/Know\\w∗ / . t e s t ( v a l u e S t r i n g ) ” # o p t i o n a l 9 ] 10 ] Optionally, a filter can be added that picks certain cells based on a JavaScript program (Line 8). Regarding our example, a regular expression is used to iterate over only those cells which contain the phrase “Know”. Using the expressive script language and variables that represent cell meta data, appropriate filter procedures can be realized. However, it is also conceivable to implement it as an Fn O function instead, which receives cell meta data and returns a boolean value. Listing 1.2: Demonstration of how cell meta data can be accessed in RML maps. 1 r r : subjectMap [ 2 r r : t e m p l a t e ” h t t p : / / example . o r g /{ a d d r e s s }” 3 ] ; 4 r r : predicateObjectMap [ 5 r r : predicateMap [ 6 r r : t e m p l a t e ” h t t p : / / example . o r g / { [ 2 , 0 ] . v a l u e S t r i n g }” 7 ] ; 8 r r : objectMap [ 9 rml : r e f e r e n c e ” ( 2 , 0 ) . v a l u e N u m e r i c ” 10 ] 11 ] In various RML maps, meta data of the currently iterated cell can be retrieved, as demonstrated in Listing 1.2. For example in Line 2, a template expression inserts the cell’s address (like “A2”) to form a unique URI for a subject resource. However, often it is required to refer to nearby cells in order to construct appropriate statements. To refer to other cells relatively from the current cell, we introduce a parenthesis notation, like (column,row). A column shift (x-axis of the sheet) and a row shift (y-axis of the sheet) allows to reach any cell relative to the cell representing the subject resource. Regarding Line 9 of our example, numeric values which are located two columns away on the right are used as objects in the mapped statements. In a similar way, cells can be referenced absolutely by using square brackets (Line 6). We use relative and absolute distances, because 9 http://www.dfki.uni-kl.de/~mschroeder/ld/ss Markus Schröder, Christian Jilek, and Andreas Dengel spreadsheets in general can contain arbitrarily structured tables without any anchor points like column names (in contrast to CSVs). As already mentioned, spreadsheet cells have several meta data values that need to be retrievable in a mapping through variables (written in the following with monospaced font). A cell’s location in a sheet can be accessed either as a usual spreadsheet address or as column and row indices. Since a cell (if not empty) stores either a string value or a numeric value, various possibilities are given to access its content: valueNumeric retrieves a floating point value (valueInt an integer value), valueBoolean a boolean value, valueFormula if a cell contains a formula (valueError to get its possible error code) and valueString retrieves its text content. Besides content, one could also be interested in the appearance of a cell in a sheet: since it can be colored, backgroundColor and foregroundColor can be queried as a hexadecimal RGB value. Regarding the used font, fontColor, fontName and fontSize are available. On a more fine-grained level, cells can also store formatted text which is returned by valueRichText. The formatted text is represented in an HTML syntax to enable usual XML-parsers to access tags and attributes, for example, “ăbąăiąăfont face=’Arial’ color=’#ff0000’ąred, italic and boldă/fontąă/iąă/bą”. If spreadsheets were completed in an inconsistent manner, it happens that cells were unintentionally filled with different data types. For such cases, one can use the value variable to always obtain a string representation of a cell’s content regardless of its cell type. However, if this is not sufficient because more details are needed, the json variable could be used to retrieve a JavaScript Object Notation (JSON) representation of a cell. The JSON object contains, besides the cell type, various data types mentioned above. In our demo implementation, we do not retrieve all aspects of a spreadsheet (e.g. border thickness), yet they could easily be added to our code if desired. 2.1 Experimental Features We also would like to propose some experimental features that we found useful in our use cases. However, the official introduction of these extensions would require to change the RML specification on some points. Multiple Different Properties in a Cell. In our use cases, we frequently experienced that users record multiple information in one cell. Each piece of information ij P I then corresponded to a different property pj P P . Following the current RML specification, one has to define for each pj P P a separate predicate- object map. Instead, we propose a shortcut such that only one predicate-object map is needed. This is done by allowing that an RDF list of properties pj P P can be passed in a predicate map. By implementing a suitable Function Ontology (Fn O) [6] procedure, we extract information pieces ij P I from a cell’s content and return them as an object list oj P O. Usually, RML provides that a Cartesian product is formed between predicates and objects which is P ˆ O. However, in our case we need to zip predicate and object lists such that the following set is made: tppj , oj qu. Thus, instead of all possible pairs of the P ˆ O matrix, Mapping Spreadsheets to RDF: Supporting Excel in RML only the diagonal ones are selected. This new behavior is activated by adding a ss:zip true statement to a predicate-object map. Multiple Complex Entities in a Cell. Similarly to the previous observation, we often discovered that users mention several entities in a single cell. A prominent example is a list of persons having first and last names, for instance book authors. Represented as an RDF graph, such complex entities potentially require several statements to be fully expressed. An Fn O function that is able to perform entity extraction would need to return an arbitrary large RDF graph in a certain serialization format (e.g. Turtle) instead of a single value. In order to integrate such a return value in an object map, we define a new term type ss:Graph. Once this term type is chosen, the returning RDF graph is parsed and added to the emerging knowledge graph. To prevent that all returning resources are mapped in the object map, a special ss:SelectedObjects resource together with rr:object statements should be returned by the function, too. This way, all resources are added to the knowledge graph, yet only selected ones are mapped. 3 Related Work In the past, several similar approaches were implemented that map spreadsheets to RDF using a language. Domain specific languages (DSL) other than RML are provided to let knowledge engineers express how input data shall be mapped to a graph structure. Spread2RDF10 uses a Ruby-internal language while M 2 (Mapping Master) [7] builds upon a compact syntax for OWL ontologies (Manchester syntax). Sheet2RDF [3] uses a special ProjEction of Annotations Rule Language (PEARL), whereas XLWrap [5] utilizes template graphs together with a special expression language to refer to contents of sheets. A different approach is followed by TabLinker11 which requires that input spreadsheets are annotated in advance with certain styles. With our implementation, we combine the feature of mapping spreadsheets with the advantages of the well specified RML approach. Those who already used the language and need to process spreadsheets do not have to look for an alternative anymore. 4 Conclusion and Outlook In this paper we discussed that spreadsheets, despite being widely used, are still not supported by the RDF Mapping Language (RML). We therefore proposed a first solution which extends the already existing RML Mapper tool with necessary components. Additionally, experimental features were introduced too. A publicly available web page12 demonstrates the new capabilities. 10 https://github.com/marcelotto/spread2rdf 11 https://github.com/Data2Semantics/TabLinker 12 http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml Markus Schröder, Christian Jilek, and Andreas Dengel In the future, we plan to integrate our proposed solution in the RML specifica- tion [1] so that other RML tools may support spreadsheets too. Further, we intend to run several performance tests (e.g. memory consumption) and compliance tests with our current reference implementation. This could be included in the R2RML implementation report13 . Acknowledgements This work was funded by the BMBF project SensAI (grant no. 01IW20007). References 1. Dimou, A., Sande, M.V.: RDF Mapping Language (RML). https://rml.io/specs/ rml/ (2020), accessed: 2021-02-01 2. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous data. In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR Workshop Proceedings, vol. 1184 (Apr 2014) 3. Fiorelli, M., Lorenzetti, T., Pazienza, M.T., Stellato, A., Turbati, A.: Sheet2rdf: a flexible and dynamic spreadsheet import&lifting framework for RDF. In: Current Approaches in Applied Artificial Intelligence - 28th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2015, Seoul, South Korea, June 10-12, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9101, pp. 131–140. Springer (2015) 4. Iglesias, E., Jozashoori, S., Chaves-Fraga, D., Collarana, D., Vidal, M.: Sdm- rdfizer: An RML interpreter for the efficient creation of RDF knowledge graphs. In: CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Ireland, October 19-23, 2020. pp. 3039–3046. ACM (2020) 5. Langegger, A., Wöß, W.: Xlwrap - querying and integrating arbitrary spreadsheets with SPARQL. In: The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings. Lecture Notes in Computer Science, vol. 5823, pp. 359–374. Springer (2009) 6. Meester, B.D., Dimou, A., Verborgh, R., Mannens, E.: An ontology to semantically declare and describe functions. In: The Semantic Web - ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers. Lecture Notes in Computer Science, vol. 9989, pp. 46–49 (2016) 7. O’Connor, M.J., Halaschek-Wiener, C., Musen, M.A.: M2: A language for map- ping spreadsheets to OWL. OWL: Experiences and Directions (OWLED), Sixth International Workshop (2010) 8. Simsek, U., Kärle, E., Fensel, D.: Rocketrml - A nodejs implementation of a use case specific RML mapper. In: Joint Proceedings of the 1st International Workshop on Knowledge Graph Building and 1st International Workshop on Large Scale RDF Analytics (ESWC 2019), Portorož, Slovenia, June 3, 2019. CEUR Workshop Proceedings, vol. 2489, pp. 46–53. CEUR-WS.org (2019) 9. World Wide Web Consortium: R2RML: RDB to RDF mapping language. https: //www.w3.org/TR/r2rml/ (2012), accessed: 2021-02-01 10. World Wide Web Consortium: RDF 1.1 Primer. https://www.w3.org/TR/rdf11- primer/ (2014), accessed: 2021-02-01 13 https://github.com/kg-construct/r2rml-implementation-report