=Paper= {{Paper |id=Vol-1383/paper10 |storemode=property |title=SICRaS: A Semantic Big Data Platform for Fighting Tax Evasion and Supporting Social Policy Making |pdfUrl=https://ceur-ws.org/Vol-1383/paper10.pdf |volume=Vol-1383 |dblpUrl=https://dblp.org/rec/conf/semweb/BouquetAZB14 }} ==SICRaS: A Semantic Big Data Platform for Fighting Tax Evasion and Supporting Social Policy Making== https://ceur-ws.org/Vol-1383/paper10.pdf
SICRaS: a semantic big data platform for fighting
tax evasion and supporting social policy making
Giovanni Adinolfi, giovanni.adinolfi@eng.it, Engineering Tributi SPA, Via G.B. Trener 8, 38121 Trento, Italy

Paolo Bouquet, bouquet@okkam.it, Okkam SRL, Via Segantini 23, 38121 Trento, Italy

Stefano Bortoli, bortoli@okkam.it, Okkam SRL, Via Segantini 23, 38121, Trento, Italy

Lorenzo Zeni, lorenzo.zeni@alysso.it, Alysso SRL, Via G.B. Trener 8, 38121 Trento, Italy

Introduction: the business needs
In these years, Italy is dealing with serious economic and social issues, which are aggravated by the recent global
economic crisis. These issues include, among others: high fiscal burden, widespread tax evasion, increasing
unemployment rate and the progressively aging of population. In this context there are two major needs that public
administrations have to meet. On one hand, it is mandatory to fight tax evasion, on the other hand, there is the need
to ensure social services in a more efficient, fair and effective way.

Our industrial need is to support governance and policy making in achieving these goals through the integrated
analysis of a large amount of information collected by both public administrations and other public officers and
organizations (e.g. notaries, public utilities and so on). This information is typically scattered over several
heterogeneous and decoupled data sources. Moreover, it might also be partially outdated, unreliable and redundant.

The adoption of semantic technologies enables the construction of an integrated, trustworthy and accurate
knowledge base, providing a picture of the fiscal and social situation of each single citizen and of the community of a
local municipality, overcoming the limitations of legacy systems. Cornerstones of the solution are: 1) a set of domain
ontologies aimed at improving data integration, and at producing useful inference from explicit information; 2) a
scalable system to reconcile identities to the same real world entity across datasets, associating a unique and
persistent name to each single entity; and 3) leveraging geo-spatial technologies to achieve a deeper understanding of
the observed districts by means of spatial analysis and reasoning.

Towards a Linked Data for tax domain
Tax information systems typically work with an extraordinary amount of data concerning many different aspects of
taxpayer’s life: personal and company details, cadastral information, job positions and so on. The role of ontologies in
data exchange and integration so as to retrace tax positions from all these information is definitely invaluable [1]. In
fact, since data are collected at various times by different partner administrations, there is the need to link all these
tax related information from distributed source streams. Looking at the domain, we recognize that tributes are
generally imposed taking into account of specific circumstances or events that happen during the taxpayer’s life: the
taking up of residency in a new town, join a nuclear family, buying a house, etc.

Given these assumptions, we chose to follow an entity/event based ontological modelling approach. The goal is to
support an integration pipeline, producing a unified view of the relevant fiscal facts scattered in datasets supplied by
diverse public institutions. The entity/event based modelling approach allows then to materialize, at a given instant of
time, the deduced tax position involving each single taxpayer.

The Pipeline from Data to (Big) Knowledge
One of the known limits of semantic technologies is scalability both in reasoning and data management. This has
twofold justification, on the one hand there are limits related to the computational complexity of reasoning based on
model theoretic semantics, on the other hand the immaturity of existing technologies for data management. To
overcome these limits, we organized a pipeline relying on an ensemble of scalable and state of the art technologies to
define a Semantic ETL suitable to create a Semantic Big Data Pool.

We rely on a customized and optimized version of Open Refine tool to perform data cleaning operation, including
syntactical validations and transformation of the original data coming from the public institutions. The formal and
syntactical validations, expressed according to a specific rule language, are the results of many years of experience on
the field. At this stage, issues related to semantic and structural heterogeneity affecting the original data are
normalized relying on a set of maintainable contextual ontology mappings towards the defined domain ontology. Each
record is analysed to extract information about the involved entities to reconcile their identities relying on the Okkam
Entity Name System [2]. Once the identity of any entity involved in each of the records has been disambiguated, the
dataset is exported in RDF and stored as many entity-centric named graphs into the Hadoop Distributed File System
(HDFS).

The result of this first part of the process is a physically distributed and logically integrated large RDF graph that can be
manipulated and processed relying on emerging big data technology. Therefore, we rely on tools like Apache HBase,
Apache Incubated Spark, Apache Incubated Flink, Apache Hive, and Apache Pig to define (complex) big data shuffling
processes producing any view, analysis and mesh-up necessary to support tax assessment domain applications. In fact,
it is possible to select subsets of the giant RDF graph to store it in application specific data management systems. For
example, we sink data into a triple store such as OpenRDF Sesame and enable scalable rule-based reasoning tasks
using SPRINGLES. Another example is to build sub-graphs to support seamless real time navigation of the knowledge
relying on effective indexing tools such as as Apache SIREn, or to perform graph-based analysis sinking data in a graph
database (e.g. Neo Technology Neo4J). Finally, it is possible to integrate semantic technologies in the core of the
SpagoBI suite in order to enable novel Business Intelligence tools and techniques [3].

Geographic technologies for spatial analysis and reasoning
The integration of geo-spatial technologies adds an important analytic dimension to SICRaS. Taking advantage of this
kind of information, we firstly intend to exploit the notion of territory, seen as a spatial region in our ontological
model. This enables new ways of extracting, observing and analysing data about real world entities and the spatial
relations among them. Secondly, we develop techniques to match entities relying on geo-spatial features to link our
knowledge base to external sources (e.g. urban development plans) to find out new information valuable for tax
assessment and for other fiscal and social purposes.

Concluding remarks
In SICRaS we define a scalable and efficient data processing pipeline, capable of overcoming the limits of current
semantic technologies riding the wave of emerging big data processing tools. A wise union of semantic and big data
technologies, tempered with deep domain knowledge and sophisticated geospatial tools, creates seamless
opportunities to define tax assessment applications. Exploiting the wealth of data in an efficient and effective way, we
aim to define the next generation of tools for policy makers and help the Italian institutions in overcoming the
                    st
challenges of the 21 century.

References
[1] Isabella Distinto, Nicola Guarino, and Claudio Masolo. 2013. A well-founded ontological framework for modeling
personal income tax. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law
(ICAIL '13). ACM, New York, NY, USA, 33-42.

[2] Paolo Bouquet, Heiko Stoermer, Claudia Niederee, and Antonio Maña . 2008. Entity Name System: The Back-Bone
of an Open and Scalable Web of Data. In Proceedings of the 2008 IEEE International Conference on Semantic
Computing (ICSC '08). IEEE Computer Society, Washington, DC, USA, 554-561.

[3] Matteo Golfarelli. 2009. Open Source BI Platforms: A Functional and Architectural Comparison. In Proceedings of
the 2009 International Conference Data Warehousing and Knowledge Discovery (DaWaK 2009) Linz, Austria, August
31–September 2, 2009 . Springer Berlin Heidelberg, 2009, 5691, 287-297