Linked Edit Rules: A Web Friendly Way of Checking Quality of RDF Data Cubes Albert Meroño-Peñuela1,2 , Christophe Guéret2 , and Stefan Schlobach1 1 Department of Computer Science, VU University Amsterdam, NL albert.merono@vu.nl 2 Data Archiving and Networked Services, KNAW, NL Abstract. Statistical data often come with inconsistencies: records of people with negative ages, pregnant males, and underaged car drivers often populate statistical databases. National Statistical Offices (NSO) encode knowledge to detect these inconsistencies in so-called edit rules. These days, there is an increasing number of statistical data being published and linked on the Web using the RDF Data Cube vocabulary. However, edit rules are hardly ever published together with data cubes on the Web. This causes two important problems: (a) quality of RDF Data Cube data cannot be assessed; and (b) reusability of edit rules is hampered. In this paper we present Linked Edit Rules (LER), a method that makes edit rules Web friendly and reusable as Linked Data. We show that LER can be easily linked, retrieved, reused, combined and executed to check quality and consistency of RDF Data Cubes, opening up the internal NSO validation processes to the Web. Keywords: RDF Data Cube, Edit rules, Statistical consistency 1 Introduction More and more statistical datasets are being published on the Web using the RDF Data Cube vocabulary (QB) [12,5], the W3C recommendation for publishing and linking multidimensional data, such as statistics, on the Web. As many Web data, these data cubes often come with inconsistencies that data consumers need to identify and fix by themselves before running their workflows. In statistics, data mining and data management this process is called data cleansing. Data cleansing is an arduous and expensive task, mainly due to the hetero- geneous nature of these errors and inconsistencies. National Statistical Offices (NSOs) set up procedures to validate data cubes before releasing them into the public domain. One of such validations consists of automatically identifying so-called obvious inconsistencies. An obvious inconsistency occurs when the cube contains a value or combination of values that cannot correspond to a real-world situation. For example, a person’s age cannot be negative, a man cannot be pregnant and an underage person cannot possess a driving license. In order to validate data cubes against obvious inconsistencies, statisticians express this knowledge as rules, known in the data editing literature as edit rules or edits. Edit rules are used to automatically detect inconsistent data points in statistical datasets, and can be divided into micro-edits and macro-edits. Micro-edits check obvious consistency of a single data record (or individual, row, observation). E.g., in a record of a demography dataset, the field age group cannot be children if the field age is 31. Macro-edits check obvious consistency of an entire field (or variable, column, dimension). E.g., population heights should be normally distributed; therefore, the field height is expected to follow a normal distribution. Current implementations only support the specification of micro-edits. Edit rules currently exist only within NSOs’ closed validation systems, hard- coded into source code, or serialized in a variety of models, syntaxes and formats. In addition, edit rules are rarely published on the Web together with the datasets to which they apply. If published at all, edit rules are only available as 1-star Linked Data, meaning that they are merely available on the Web, but with poor structure, in non-standard formats, in a non-uniquely identifiable way, and not linked with any other Web resource. These non-Web friendly practices hamper the use of these rules to validate statistical data. Consequently, edit rules cannot be easily located, retrieved, shared, reused nor combined on the Web to validate RDF Data Cubes. Statisticians need to constantly reimplement them offline. The contribution of this paper is to overcome these pitfalls by applying the principles of Linked Data to edit rules. Concretely: – We survey existing work on Semantic Web languages for constraint checking – We provide a framework, a data model and an implementation to express micro- and macro-edits on the Web as Linked Edit Rules (LER). The resulting LER are linked to RDF Data Cubes via QB dimensions, and check their per- and inter-record consistency. – We build an automatic consistency checker using LER, QBsistent 3 , which accurately finds all expected inconsistencies of various RDF Data Cubes on the Web and generates accurate provenance reports using PROV. The rest of the paper is organised as follows. In Section 2 we define our problem and list our requirements, linking them to existing work in Section 3. In Section 4 we describe our Linked Edit Rules approach, implementing it in Section 5. In Section 6 we evaluate Linked Edit Rules, and we conclude in Section 7. 2 Background and Problem Definition Micro-edits are constraints that must be met at the record level, i.e., by each individual row independently of the others. For instance, in statistical frameworks like R we can write micro-edits (see Listing 1.1) to check obvious consistency of individual records (e.g., like shown in Table 1). Typically, variable names in edits of Listing 1.1, like age and height, must match column names of the data to be checked (see Table 1). The edit rules may constrain the values of a variable and the dependency between its values and values of other variables. Therefore, in general, micro-edits can decide if an individual record is consistent or not by 3 See http://www.linkededitrules.org/ 2 1 dat1 : ageGroup %in% c(’adult’, ’child’, ’elderly’) 2 dat7 : maritalStatus %in% c(’married’, ’single’, ’widowed’) 3 num1 : 0 <= age 4 num2 : 0 < height 5 num3 : age <= 150 6 num4 : yearsMarried < age 7 cat5 : if(ageGroup == ’child’) maritalStatus != ’married’ 8 mix6 : if(age < yearsMarried + 17) !(maritalStatus %in% c(’married’,’widowed’)) 9 mix7 : if(ageGroup %in% c(’adult’, ’elderly’) age >= 18 10 mix8 : if(ageGroup %in% c(’child’, ’elderly’) & 18 <= age) age >= 65 11 mix9 : if(ageGroup %in% c(’adult’, ’child’)) 65 > age Listing 1.1: Examples of micro-edits in the R editrules package. considering just one record at a time. In the examples in Table 1 and Listing 1.1, record #2 is inconsistent with edits cat5 and mix6; record #3 with edits num4 and mix6; record #4 with edit num3; and record #5 with edits num2, cat5 and mix8. age ageGroup height maritalStatus yearsMarried #1 21 adult 6.0 single -1 #2 2 child 3 married 0 #3 18 adult 5.7 married 20 #4 221 elderly 5 widowed 2 #5 34 child -7 married 3 Table 1: Example dataset to be validated against obvious inconsistencies. On the other hand, macro-edits are rules aimed at discovering inconsistencies through the analysis of multiple records. These rules can have a very different nature. For instance, the aggregation method [6] consists of checking whether aggregates on the data (e.g., the total population count of a country) match their logical decomposition into individual records (e.g., adding up the population totals of every individual municipality). The distribution method [1] asserts knowledge about the distribution of a certain variable (e.g., “population counts must be log-normal distributed”) and identifies all individual records that do not fit that distribution. Macro-edits can only decide if an individual record is consistent or not by performing a double-pass on the dataset. Hence, it is clear that languages tailored to micro-edits (see Listing 1.1) are insufficient to express the semantics of macro-edits. We investigate whether Semantic Web rule languages are adequate for representing micro- and macro-edits. Edit rules are not currently published on the Web in a structured way. They are neither linked to the cubes and observations they intend to validate, nor to the dimensions they constrain. Users cannot uniquely reference, retrieve or combine them to validate RDF Data Cubes. Validation workflows are therefore kept closed within NSOs systems or private user frameworks, affecting their openness, referenceability, reusability, linkage and exchange. To overcome these pitfalls, we define a set of requirements: – Expressing edit rules in a Web friendly, accessible and standard way: • [WebDistrib] Edit rules should be published on the Web in a distributed way, separately from (but linked to) the statistical data they apply to. 3 • [WebStructured] Edit rules need to be expressed in a Web standard structured data format to ensure their machine-readability. • [UniqID] Edit rules, and their components, need to be uniquely identified in order to be unambiguously referenceable. – Edit rules as Semantic Web rules enriched with metadata: • [WebRules] Edit rules should be expressed in currently available and adequate formal Semantic Web rule languages. • [ConstrainedDims] Edit rules need to be explicit about, and uniquely reference to, the statistical dimensions they constrain. • [Scope] Edit rules need to be explicit about their scope and aggregation level (micro/macro). – Checking obvious consistency combining rules and reasoning: • [Reasoning] Edit rule checking should be transparently integrated into standard reasoning and established consistency checking mechanisms in currently available triplestores and inference engines, in order to preserve interoperability of current infrastructure. • [CubeCheck] Users need to be able to express which edit rules they want to be checked against which data cubes. – Creating interoperable consistency reports using Web standards: • [Prov] Provenance of the execution of edit rules should be explicit, traceable, and represent accurately the consistency check workflow. • [Annotation] Edit rules must annotate inconsistent data points for poste- rior expert correction, preserving the raw data. Statisticians are interested in concrete inconsistent data points rather than a consistent/inconsistent response over the entire dataset. Consequently, our problem definition is to identify obvious inconsistency of arbitrary RDF Data Cubes by executing (micro and macro) edit rules following Linked Data principles in an open Web environment. 3 Related Work Data editing is “the activity aimed at detecting and correcting errors (logical inconsistencies) in data” [18]. Traditionally, edits have been considered at two levels: micro-edits, aimed at finding inconsistencies in individual records; and macro-edits, “based upon analysis of an aggregate rather than an individual record” [4]. More recently, automatic processing of these edits has gained importance [10]. For example, the R packages editrules (validation through micro-edits) and validate (validation through cross-record and cross-dataset macro-edits) are useful tools to automatically validate locally stored statistical datasets. However, these rule formats are not Web standard ([WebStructured]) nor suitable for efficient Web distribution ([WebDistrib]). Eurostat proposes the eDAMIS Validation Engine (EVE) to validate statistical datasets defining intra (micro) and inter (macro) record rules, but these are kept internally in the system. Finally, the SDMX initiative proposes the Validation and Transformation Language (VTL) . However, VTL (1) has a strong emphasis on transformation instead of validation 4 ([CubeCheck]); and (2) defines no mechanism to publish nor link edit rules on the Web ([WebDistrib], [WebStructured], [UniqID]). No existing approach tackles explicitly requirement [Scope], nor generates interoperable validation reporting ([Prov], [Annotation]). The related work on the Semantic Web can be divided in (a) rule languages in which edit rules could be expressed ([WebRules]); and (b) initiatives and systems for customized validation of linked and semantic data ([Reasoning]). On rule languages, OWL 2 [17] allows the definition of Description Logics (DL) safe rules. The Semantic Web Rule Language [9] (SWRL) extends DL rules allowing Horn clauses, although limiting the creation of new individuals in the ABox to avoid infinite rule loops. SPARQL [7] can also be used to express rules as CONSTRUCT queries, although these are lost after execution ([UniqID]). The Rule Interchange Format Basic Logic Dialect [2] (RIF-BLD) partially overcomes this by allowing the expression of rules in RDF, and particularly serializing SPARQL as RDF. The SPARQL Inference Notation [11] (SPIN) also uses SPARQL to express and store a variety of business rules [13]. Shape Expressions [14] associate RDF graphs with labeled patterns called shapes, and can be used for validation, documentation and transformation of RDF data. Similarly, Resource Shapes [15] specify the properties that are allowed or required in a resource, their value types, and cardinality. With respect to validation systems, OWLIM Profiles allow customization of rules for RDF data validation, although these custom rules must be hard coded, hampering their reuse ([WebStructured], [UniqID]). TopBraid [16] allows customization of business rules via SPIN. Finally, Stardog [3] follows a polyglot approach and users can write SPARQL queries, OWL axioms, or SWRL rules for integrity constraint validation against RDF data. The RDF Data Cube vocabulary [5] (QB) defines 22 integrity constraints that Web-linked cubes must meet, implemented in SPARQL ASK, but these are meant to preserve the constraints of QB and are not useful to model domain-dependent rules. 4 Approach 4.1 Linked Edit Rules and RDF Data Cube Meeting requirements [WebStructured] and [UniqID] of Section 2 is straight- forward when applying Linked Data principles. Concretely, we propose (1) to represent edit rules in RDF, in order to publish and link edit rules using a Web structured-data standard format; and (2) to use URIs to denote edit rules and their components in RDF to uniquely identify and de-reference them. To meet requirements [ConstrainedDims] and [Scope], we analyse the RDF Data Cube (QB) data model and propose the minimal set of LER exten- sions shown in Figure 1. The LER vocabulary can be found at http://bit.ly/ linked-edit-rules#. The class ler:EditRule represents an edit rule and can be subclassed by any rule model. A ler:EditRule has three fundamental compo- nents: a ler:body, a ler:scope, and one or many ler:components. The ler:body represents the rule itself, encoded according to a specific rule language. The ler:scope represents the scope of the edit, and we use it to distinguish micro-edits 5 Fig. 1: RDF Data Cube data model (straight lines), and our proposed LER extensions (dashed lines) to define Linked Edit Rules. and macro-edits. Since micro-edits are defined at the individual record level, and individual records are represented as qb:Observations in QB (see Figure 1), we define micro-edits in our model as any ler:EditRule with qb:Observation as ler:scope. Likewise, QB allows the representation of groups of observations (records) according to some criteria as slices (qb:Slice), as well as the complete set of observations of the cube (qb:DataSet) (see Figure 1). We define macro-edits as any ler:EditRule with qb:Slice, qb:ObservationGroup or qb:DataSet as ler:scope. Finally, we link the ler:EditRule to all the statistical variables it constrains through ler:component. In QB, statistical variables are represented using the subclasses of qb:ComponentProperty (typically qb:DimensionProperty). 4.2 From edit rules to Linked Edit Rules In order to meet requirement [WebRules] we study the expressiveness of micro- edits and macro-edits. Body of Linked Micro-Edits. Definite Horn clauses are clauses (disjunc- tions of literals) with exactly one unnegated literal, ¬p ∨ ¬q ∨ ... ∨ ¬t ∨ u, usually written in implication form as p ∧ q ∧ ... ∧ t → u. This includes the case of no negative literals, also called facts (e.g., u). Definite Horn clauses are known to be tractable in Semantic Web rule languages. Micro-edits (see Listing 1.1) fit definite Horn clauses if we consider each literal p, q, ..., u as an inequality. An in- equality is an expression that involves variables, numeric constants and algebraic operators, and has exactly one of the symbols [>, <, ≤, ≥, =, 6=]. For example, (age = currentYear − bornYear) and (height ≥ 11.5) are inequalities. Micro-edits num1 to num4 in Listing 1.1 are facts, composed of one inequality (literal). Rules cat5 and mix7 to mix9 do have negative literals and consequently a rule body. We can express the special construct v %in% l1 , ..., ln as the clause v = l1 ∨ ... ∨ v = ln . This makes rule mix6 problematic because of a positive conjunction of literals in the rule’s head. Since variables in the rule’s body are stable during execution, we solve this using normalization, splitting mix6 in several rules with the same body as mix6 and only one of the positive literals in the head. To express micro-rules as Linked Edit Rules, we convert all variables and values of Listing 1.1 to RDF URIs and literals. We substitute each variable name by URIs of qb:DimensionProperty (e.g., replacing age by sdmx-dimension:age). We 6 substitute string literals by their equivalents in known skos:ConceptScheme (e.g., married, widowed to sdmx-code:status-M, sdmx-code:status-W). Finally, we replace all numeric values by RDF literals. Extensions for Macro-Edits. Macro-edits need specific statistical terms to be used in the rules. For instance, X ∼ N (µ, σ 2 ) (where X represents heights in the cube of Table 1) is a valid macro-rule that states that this dimension must follow a normal distribution with mean µ and variance σ 2 . To encode such constraints, we extend the above rules for micro-edits to macro-edits, by adding the following function as a valid member of literals (i.e., allowing it in inequalities): statistic(t, P, S, c), where: – t is a test statistic to assess the constraint (e.g., the z.test normality test) – P = {p1 , ..., pn } are the parameters of t (e.g., mean µ and variance σ 2 ) – S = {s1 , ..., sm } are the sets of observations that must adhere to the rule (e.g., all observations of the cube; two particular slices) – c is the constrained dimension of those observations (e.g., eg:height) We use the function statistic to describe statistical properties of observation groups in macro-edits, and we embed this function in micro-edit-like clauses (as described above) to express macro-rules as Linked Edit Rules. We use the output of this function (often a p-value) to express the meaning of the macro-edit. For example, we rewrite the macro-edit that checks whether heights of all persons in Table 1 follow a normal distribution as follows, assuming eg:all to be the URI of all cube observations, and normal distribution of heights as null hypothesis: statistic(z.test, {µ, σ 2 }, eg:all, eg:height) > 0.05 4.3 LER Architecture Edit rules and data cubes may be published in different locations on the Web, but both types of resources need to be combined to validate cubes against obvious inconsistency. To achieve this, we propose a two-component based architecture: (a) a node with arbitrary RDF Data Cube; and (b) a node with Linked Edit Rules. The workflow is: 1. The user sends query Q to (a), asking which data cube entity e ∈ E of (a) (instances of qb:Observation, qb:Slice or qb:DataSet) is inconsistent with which rule r ∈ R (instances of ler:EditRule) (see Section 4.1). 2. The cube triplestore (a) retrieves rules R from (b) as indicated in Q. If rules R are distributed among more nodes, (a) proceeds iteratively. 3. The cube triplestore (a) updates its inference engine with R. 4. The cube triplestore (a) performs custom inference as specified by R to decide which e ∈ E is inconsistent with which r ∈ R. 5. The cube triplestore (a) returns to the user: (1) data cube entities e ∈ E annotated with the r ∈ R they are inconsistent with; (2) the provenance graph of the execution. 7 5 Implementation We implement the Linked Edit Rules (LER) method described in Section 4 using Stardog, using its custom reasoning capabilities to check obvious consistency of Web RDF Data Cubes. Stardog allows this customization supporting rules in multiple formats and models, including SPARQL, OWL axioms, and SWRL. We implement Linked Edit Rules ler:EditRule with micro-edits as Stardog Rules rule:SPARQLRule. In Stardog, rules are defined using SPARQL Basic Graph Patterns (BGPs), plus the decorators IF-THEN, denoting the body and the head of a rule. Listing 1.2 shows how rule mix6 of Listing 1.1 translates into a Stardog Rule (other rules are published at http://www.linkededitrules.org/). To match the data to be validated in RDF Data Cube, the triple pattern in the IF clause must select the appropriate rule scope and components. To obtain commonly used URIs for such components we use LSD Dimensions4 [12]. When the knowledge base is queried in SL reasoning mode and the IF clause holds, the THEN clause is executed and violation triples are inferred. Finally, we include ler:scope and ler:component triples to fit the LER extensions (see Section 4.1). Other metadata describe the original rule form, its author and the creation date. We also implement macro-edits as Linked Edit Rules, preserving the rule format of micro-edits. Listing 1.2 shows a macro-edit that checks if heights of adults and non-adults in Table 1 follow different statistical distributions. To implement the statistic function of Section 4.2 and make statistical tests available in Stardog Rules, we develop a Stardog extension5 that wraps R function calls using SPARQL Extensible Value Testing (EVT). All Linked Edit Rules of this paper (e.g., translations of edit rules in Listing 1.1) are published on the Web as Linked Data at http://www.linkededitrules.org/. To implement the LER Architecture of Section 4.3, we read the data cubes to be validated, we retrieve their associated LER via SPARQL, and we add these LER to Stardog’s rule base; these will be triggered whenever Stardog is queried. Stardog query rewriting and rule normalization mechanisms split rules when multiple triples are found in the THEN clause. This hampers the generation of provenance and annotation graphs, since new instances (e.g., of class prov:Activity) cannot reference the specific data points and rules used for consistency checking. To solve this, we use rules that create only one ler:inconsistentWith statement, as shown in Listing 1.2, and we trigger these rules using a SPARQL INSERT query, as shown at the bottom of Listing 1.2. Using this query, we materialize provenance and annotation reporting using PROV and the Open Annotation Data Model (OA), avoiding to rewrite the BGP for provenance and annotation generation. Finally, to generate the output we use a SPARQL CONSTRUCT query that creates user reports with provenance and annotation graphs describing all inconsistencies found (see Figure 2). We bundle the entire consistency checking procedure in QBsistent 6 , a customizable LER-QB consistency checking tool. 4 http://lsd-dimensions.org/ 5 https://github.com/albertmeronyo/stardog-r 6 See http://www.linkededitrules.org/ 8 1 # Micro-edit 2 leri:mix6 a rule:SPARQLRule, ler:EditRule; 3 rule:content """ # PREFIX definitions 4 IF { 5 ?obs a qb:Observation. 6 ?obs sdmx-dimension:civilStatus ?civilStatus. 7 ?obs eg:yearsMarried ?yearsMarried. 8 ?obs sdmx-dimension:age ?age. 9 FILTER ((?age < ?yearsMarried + 17) && (?civilStatus = sdmx-code:status-M || ? civilStatus = sdmx-code:status-W)) 10 } THEN { 11 ?obs ler:inconsistentWith leri:mix6. 12 } """; 13 ler:scope qb:Observation; 14 ler:component sdmx-dimension:age, sdmx-dimension:civilStatus, eg:yearsMarried; 15 rdfs:label "if(age < yearsmarried + 17) !(status %in% c(’married’, ’widowed’))"; 16 rdfs:comment "An underage can’t be married nor widowed"; 17 dc:creator ; 18 dc:date "2015-01-08T16:03:40+01:00"^^xsd:dateTime. 19 20 # Macro-edit 21 leri:macro1 a rule:SPARQLRule, ler:EditRule; 22 rule:content """ # PREFIX definitions 23 IF { 24 ?x a qb:Slice . 25 ?x qb:sliceStructure eg:sliceByAdults . 26 ?y a qb:Slice . 27 ?y qb:sliceStructure eg:sliceByNonAdults . 28 FILTER(stardog:R(’wilcox.test’, ?x, ?y, eg:height) <= 0.05) 29 } THEN { 30 ?x ler:inconsistentWith leri:macro1 . 31 ?y ler:inconsistentWith leri:macro1 . 32 } """; 33 ler:scope qb:Slice ; 34 ler:component eg:height . 35 rdfs:label "dist(X) != dist(Y), X heights of adults, Y heights of non-adults"; 36 rdfs:comment "Heights of adults and non-adults follow different distribs."; 37 dc:creator ; 38 dc:date "2015-01-08T16:03:40+01:00"^^xsd:dateTime. 39 40 # Generating PROV and OA 41 INSERT { ?act a prov:Activity; 42 rdfs:label "Consistency check"; 43 prov:wasAssociatedWith ; 44 prov:startedAtTime ?now; 45 prov:used ?dp; 46 prov:used ?rule . 47 ?ann a oa:Annotation; 48 rdfs:label "Inconsistency annotation"; 49 prov:wasGeneratedBy ?act; 50 prov:generatedAtTime ?now; 51 oa:hasBody ?body; 52 oa:hasTarget ?dp . 53 ?body a rdfs:Resource; 54 ler:inconsistentWith ?rule. 55 BIND (UUID() AS ?act, UUID() AS ?ann, UUID() AS ?body, now() AS ?now) 56 } WHERE { ?dp ler:inconsistentWith ?rule . } Listing 1.2: Linked Edit Rules as Stardog Rules. The predicate rule:content describes the content of the rule, while other triples describe the rule metadata. For macro-edits, statistical tests are accessed via SPARQL custom functions through an R wrapper we implement as a Stardog extension. The bottom INSERT query triggers Linked Micro- and Macro-Edit Rules in Stardog, generating PROV and OA for each inferred inconsistency. 9 6 Evaluation We validate our implementation by studying: (1) the semantic equivalence of edit rules and Linked Edit Rules; (2) their precision and recall at identifying inconsistencies; and (3) insights provided by the implementation of edit rules as Linked Data. We use toy and synthetic datasets, and the QB representations of the Dutch historical censuses7 and the 2010/2011 French and Australian censuses8 . Full details on results are available at http://www.linkededitrules.org/. To assess a correct translation between the original edit rules and Linked Edit Rules and show their equivalence (1), we provide translation mappings between edit rules and SPARQL in Table 2a. We stress-test our Linked Edit Rules implementation on correctly identifying inconsistencies (2) in two different datasets: (a) the toy data in Table 1; and (b) a synthetic dataset with artificial data. In both cases we know beforehand where the inconsistencies are: they are trivial in (a) (see Section 2), and we introduce them intentionally in (b). In both cases, our implementation proves to identify correctly all expected inconsistencies, as shown in Table 2b (p = r = 1.0). Interestingly, the consistency check process takes only a small fraction (0.5%) of the total runtime, which accounts mostly to overhead due to initialization and data fetching. We also showcase interesting Linked Data features of LER. We run our approach on a real-world dataset, the Dutch historical censuses. Our goal is to identify potential inconsistencies to aid the curation process of the dataset maintainers. We use a set of census domain LER: (i) each population observation must have, at most, one occupation position (occupation positions distinguish, e.g., business managers from ordinary labour) [micro-edit]; (ii) population counts must be integer numbers [micro- edit]; (iii) population counts must be positive [micro-edit]; and (iv) population counts must meet Benford’s Law [macro-edit]. We find that, out of 5,429,104 observations, 244,406 (4.50%) are inconsistent with respect to (i), accounting for two or more occupational positions; 30,317 (0.56%) count population with non-integer numbers (ii); and 909 (0.02%) do so with negative numbers (iii). Since these LER apply to any census data, we use their Linked Data features to test (ii), (iii) and (iv) against the French (4,848,096 observations) and Australian (12,650 observations) census editions of 2010-2011 (we skip (i) since these contain no labour position data). In both cases, no observation is inconsistent with (iii), although all of them are inconsistent with (ii), most probably due to data anonymization and normalization. All Dutch, French and Australian datasets fit Benford’s Law (iv). Figure 2 shows the PROV graph of a consistency check and the generated inconsistency annotations using the Open Annotation DM (OA). 7 Discussion and Future Work In this paper we present Linked Edit Rules (LER), a method to express edit rules as Linked Data and use them to validate arbitrary RDF Data Cubes on the Web. Our proposal and implementation of using currently available 7 http://lod.cedar-project.nl/cedar/ 8 http://www.datalift.org/en/event/semstats2013/challenge 10 Edit rule SPARQL Dataset Size Inc. p r Reas. Overh. x > t, t ∈ R FILTER(?x > t) editrules 5 8 1.0 1.0 0.005 0.995 x %in% (v1 , ..., vn ) FILTER(?x = v1 || ... || ?x = vn ) Synthetic 70,700 131 1.0 1.0 0.005 0.995 if(x) y FILTER(¬?x || ?y ) 2 (b) Size of tested datasets (number of observa- Xs ∼ N (µ, σ ) FILTER(statistic(t, P, S, c)) tions), number of found inconsistencies, precision, (a) Translating edit rules into SPARQL LER. The recall, and proportion of execution time devoted to ¬ operation in SPARQL means inverting the in- consistency checking (Reas.) and remaining over- equality contained in the literal. head (Overh.) using LER. Table 2: LER correctness validation: equivalence of edit rules and SPARQL LER, and stress-testing. (a) PROV-O-Viz [8] diagram showing a con- (b) Generated OA annotation of a detected inconsis- sistency check of observation leri:o2 and LER tency in a qb:Observation, linking observation leri:o2 with leri:cat5 to produce an annotation. the LER leri:cat5. Fig. 2: Reporting obvious inconsistency using Web standards for provenance and annotations. Semantic Web rule standards fulfills the requirements of Section 2. We show that expressing micro-edits using Semantic Web rule languages is possible, by using SPARQL syntax and Stardog’s Rule Reasoning. We design a generic statistic function that provides a statistical vocabulary of multi-observation tests, enabling the expression of Linked Macro-edits. Expressing edit rules as Linked Data provides: (a) extensive and concise provenance (PROV) reports that annotate (OA) detected inconsistencies without modifying the source data, enabling interoperable consistency-check reporting; and (b) the ability to link edit rules with constrained Web dimensions and measures, enabling an easy reuse of edit rules, as shown with diverse datasets in Section 6. Arguably, a simpler implementation could be designed using SPARQL only (e.g., by leveraging CONSTRUCT and query federation), but important requirements such as [UniqID] would break, hampering the reusability of rules. Conversely, we combine Linked Data, rules on the Web, custom reasoning, SPARQL querying, R functionality and provenance generation to check quality of data cubes. A limitation of our macro-rule implementation is that observations cannot be arbitrarily selected using the rule’s BGP body. To do so it is necessary to implement the function statistic (see Section 4.2) as a custom SPARQL aggregation function. This is problematic, as (1) such functions are part of SPARQL’s grammar, and (2) their customization is currently not supported in Stardog. We solve this by using the links of qb:Slice and qb:DataSet to the observations they contain, transparently processing these observations without custom aggregation functions. Arbitrary selections of observations must be explicitly asserted in the graph as qb:Slices. We plan to extend LER in several aspects. First, we are interested in check- ing consistency of rule sets by themselves, before running consistency against data. Second, we intend to automatically retrieve relevant rules given an RDF Data Cube Data Structure Definition (DSD). Third, we plan on publishing our 11 QBsistent tool as a web service. Fourth, we want to provide a LER editor to facilitate rule editing and publishing. Last, we will study the genericity of our approach by implementing it in other domains. Acknowledgements This work has been supported by the Computational Humanities Pro- gramme (http://ehumanities.nl) of the Royal Netherlands Academy of Arts and Sciences and the Dutch national program COMMIT. Special thanks go to Frank van Harmelen, Andrea Scharnhorst, Veruska Zamborlini, Wouter Beek, Steven de Rooij and Rinke Hoekstra. References 1. Bethlehem, J.: Applied Survey Methods: A Statistical Perspective. Wiley (2009) 2. Boley, H., Kifer, M.: RIF Basic Logic Dialect. Tech. rep., World Wide Web Consor- tium (2013), http://www.w3.org/TR/rif-bld/ 3. Complexible, Inc.: Stardog 3.1.4. http://stardog.com/ (2015) 4. Cox, N., Croot, D.: Data editing in a mixed DBMS environment. Statistical Journal of the United Nations Economic Commission for Europe 8(2), 117–136 (1991) 5. Cyganiak, R., Reynolds, D., Tennison, J.: The RDF Data Cube Vocabulary. Tech. rep., World Wide Web Consortium (2013), http://www.w3.org/TR/vocab-data-cube/ 6. Granquist, L.: Macro-editing – A review of some methods for rationalizing the editing of survey data. Statistical Journal of the United Nations Economic Commission for Europe 8(2), 137–154 (1991) 7. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. Tech. rep., World Wide Web Consortium (2013), http://www.w3.org/TR/sparql11-query/ 8. Hoekstra, R., Groth, P.: PROV-O-Viz - Understanding the Role of Activities in Provenance. In: 5th International Provenance and Annotation Workshop (IPAW 2014). LNCS, Springer-Verlag, Berlin, Heidelberg (2014) 9. Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.: SWRL: A Semantic Web Rule Language Combining OWL and RuleML. Tech. rep., World Wide Web Consortium (2004), http://www.w3.org/Submission/SWRL/ 10. de Jonge, E., van der Loo, M.: An introduction to data cleaning with R. Tech. rep., Statistics Netherlands (2013), discussion paper 11. Knublauch, H.: SPIN – Modeling Vocabulary. Tech. rep., World Wide Web Consor- tium (2011), http://www.w3.org/Submission/spin-modeling/ 12. Meroño-Peñuela, A.: LSD Dimensions: Use and Reuse of Linked Statistical Data. In: Proceedings of the 19th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2014 (2014) 13. O’Riain, S., McCrae, J., Cimiano, P., Spohr, D.: Using SPIN to formalise XBRL Accounting Regulations on the Semantic Web. In: Proceedings of the First Interna- tional Workshop on Finance and Economics on the Semantic Web (FEOSW 2012). Extended Semantic Web Conference (ESWC) (2012) 14. Prud’hommeaux, E.: Shape Expressions 1.0 Primer. Tech. rep., World Wide Web Consortium (2014), http://www.w3.org/Submission/2014/SUBM-shex-primer-20140602/ 15. Ryman, A.: Resource Shape 2.0. Tech. rep., World Wide Web Consortium (2014), http://www.w3.org/Submission/2014/SUBM-shapes-20140211/ 16. TopQuadrant, US: TopBraid Composer. Features and getting Started Guide Version 1.0 (2007), http://www.topbraidcomposer.com/ 17. W3C OWL Working Group: OWL 2 Web Ontology Language. Tech. rep., World Wide Web Consortium (2012), http://www.w3.org/TR/owl2-overview/ 18. de Waal, T., Pannekoek, J., Scholtus, S.: Handbook of Statistical Data Editing and Imputation. Wiley (2011) 12