<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Personalised Exploration Graphs on top of Data Lakes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Devis Bianchini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valeria De Antonellis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimiliano Garda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Brescia, Dept. of Information Engineering Via Branze 38</institution>
          ,
          <addr-line>25123 - Brescia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The volume, velocity and uncontrolled variety of Big Data are changing the way data exploration for data-driven decision making is performed on top of Data Lakes. As data grows, novel methods are needed for data aggregation by means of indicators and multi-dimensional analysis of Data Lakes content, enabling exploration of data according to various dimensions, thus empowering users with diverse roles and competencies to capitalise on the available information. In this paper, we present a computer-aided approach (named PERSEUS, PERSonalised Exploration by User Support) for data exploration on top of a Data Lake. The approach is structured over three phases: (i) the construction of a semantic metadata catalog on top of the Data Lake; (ii) the creation of an Exploration Graph, based on metadata contained in the catalog, containing the semantic representation of indicators and analysis dimensions; (iii) the enrichment of the definition of indicators with personalisation aspects (based on users' profiles and preferences) to identify Exploration Contexts, in turn delimiting portions of the Exploration Graph for a personalised and interactive exploration of indicators. Results of an experimental evaluation in the Smart City domain are presented with the aim of demonstrating the feasibility of the approach.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic data lake</kwd>
        <kwd>personalised data exploration</kwd>
        <kwd>OLAP</kwd>
        <kwd>Big Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In dynamic and rapidly evolving environments permeated by the volume, velocity and
uncontrolled variety of Big Data, Data Lakes have been proposed as ground-breaking solutions to
develop applications for data-driven decision making. Data Lakes ensure a suitable degree of
lfexibility for managing diferent types and formats of data sources, since data is loaded “as is”
and transformed only when it becomes necessary [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, as data grows, novel methods
are needed to extract value from Data Lakes content, aggregating data into indicators according
to various dimensions, thus empowering users with diverse roles and competencies to explore
available information. In this paper, we present a computer-aided approach (named PERSEUS,
PERSonalised Exploration by User Support) for data exploration on top of a Semantic Data
Lake. The approach is structured over three phases: (i) the construction of a semantic metadata
catalog on top of the Data Lake; (ii) the creation of an Exploration Graph, based on metadata
catalog, containing the semantic representation of indicators and analysis dimensions; (iii) the
enrichment of the definition of indicators with personalisation aspects (based on users’ profiles
and preferences) to identify Exploration Contexts, in turn delimiting portions of the Exploration
Graph for a personalised and interactive exploration of indicators. An extended version of this
work has been presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where we validated the approach in the scope of the Brescia
Smart Living project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The aim of the project was to enable citizens, energy providers and
Public Administration to explore heterogeneous information available in the context of a Smart
City, at diferent levels of aggregation, for making decisions and promoting virtuous behaviour
in using private and public resources. The paper is organised as follows. Sections 2–4 describe
the phases of the PERSEUS approach. An excerpt of the implementation details and of the
experimental evaluation is reported in Section 5. Section 6 reviews the state of the art. Finally,
Section 7 closes the paper, sketching future research directions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Semantic Data Lake construction</title>
      <p>We model a Data Lake as a set of  data sources , each one modelled as ⟨,  , ℳ⟩,
where: (i)  is a set of attributes; (ii)   is a collection of data sets, representing the content
of the data source regardless its nature (i.e., structured, semi-structured, unstructured); (iii) ℳ
is a set of attribute-value pairs containing metadata apt to access the source (e.g., username,
password) and other source-specific metadata. Each data set  ∈   is defined over a set
of attributes  ⊆ . An attribute can be either: (i) a simple attribute or (ii) an attribute
referencing another data set in the same data source (nesting).</p>
      <p>
        The domain expert is in charge of creating the semantic metadata catalog by means of a
web-based tool supporting basic annotation tasks. The annotation procedure regards only
attributes names and not their values, thus reducing the annotation burden. The steps for the
creation of the catalog are performed incrementally, as soon as new data sources are added.
Lexical enrichment of data source attributes. Each attribute  of a data source is associated
with a label referred to as Entity Property, to reduce the gap between the attribute name and
names of concepts used for semantic annotation. To this aim, domain experts are supported
by two external linguistic APIs, conceived to complement each other: (i) an Abbreviations
API [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], providing a dictionary of acronyms and their expansion, and (ii) WordNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the
widely adopted lexical database.
      </p>
      <p>
        Semantic annotation of data source attributes. Starting from the Entity Property, the
web-based tool retrieves a suitable concept describing the meaning of the attribute . To this
aim, a set of domain ontologies stored within an open access repository (LOV - Linked Open
Vocabularies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) is accessed through a proper API to search for semantic concepts whose names
match the Entity Property label. The top-ranked concept is automatically proposed for the
annotation to the domain expert, who may revise the annotation.
      </p>
      <p>Semantic metadata catalog population. The semantic metadata catalog constructed over
the Data Lake contains: (i) the set of concepts annotating attributes of data sources; (ii)
equivalence relationships between pairs of concepts, either associated with the same data
source or diferent data sources, which are suggested relying on the metadata set ℳ (e.g.,
when the concepts annotate attributes belonging to two tables of a relational database) or
manually defined by the domain expert (e.g., when involved attributes belong to diferent data sources).</p>
      <p>Data sources
{
"city": "Brescia",
coords:[{
"lat": "45.54155",
"long": "10.2118"
},}]
{ "city": "Sarezzo",
coords:[{
"lat": "45.36186",
"long": "10.13109"
},}]
...</p>
      <p>JSON file
"comm"; "city"
"Borgo Trento"; "Brescia"
CSV file "B..u.ffalora"; "Brescia"</p>
      <p>Data sets representation
city "Brescia" coords
lat "45.54155" long "10.2118"
comm "Buffalora" city "Brescia"</p>
      <p>City
place:City
place:City</p>
      <p>City
city</p>
      <p>Semantic metadata catalog
city</p>
      <p>coords
lat
Latitude</p>
      <p>S1:
Latitude
osadm:
Community
Community
comm</p>
      <p>long
Longitude</p>
      <p>S1:</p>
      <p>Longitude
geo:Spatial</p>
      <p>Thing</p>
      <p>Legend
Domain Ont.</p>
      <p>Concept
Specialised</p>
      <p>Concept
rdfs:subClassOf
Concept specialisation
Equivalence relationship</p>
      <p>Attributes</p>
      <p>Entity
Properties
Concepts</p>
      <p>Example. The left side of Figure 1 illustrates examples of two Smart City data sources and their
representation as attributes and data sets. The two sources contain geospatial information
of cities and related administrative areas. The right side of Figure 1 shows the semantic
representation of the sources in the semantic metadata catalog. The Entity Properties are
retrieved from WordNet (e.g., for country attribute) and the Abbreviations API (e.g., lat and
long attributes). To find suitable concepts for semantic annotation, the LOV Search Term API
is invoked using the Entity Properties as query parameters. Two concepts (Latitude and
Longitude) have been obtained as a specialisation of the ones extracted from LOV ontologies
(through the rdfs:subClassOf semantic relationship). In the figure, blue arrows denote
equivalence relationships between concepts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Creation of the Exploration Graph</title>
      <p>
        Indicators are modelled by data analysts starting from the knowledge retained in the
semantic metadata catalog and through the specialisation of concepts and relationships of a
MultiDimensional Ontology (MDO), containing the conceptual elements that must be taken into
account to model indicators. In the design of the MDO, pivotal concepts from available
foundation ontologies have been exploited to: (i) represent users’ activities (Schema.org ontology), (ii)
characterise indicators and dimensions as analytical data entities (Data Cube ontology) and (iii)
model units of measure for indicators (OM ontology). Further details regarding the conceptual
elements of the MDO can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The result of this phase is an Exploration Graph 
(an example is given in Figure 2, whose construction follows the steps reported below and it is
accomplished with the support of the Protégé tool [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Creation of indicator concept. In this step the Indicator concept of the MDO is specialised
to extend the indicators hierarchy. Using the takesDataFrom semantic relationship, composite
indicators can be defined starting from other fine-grained indicators. For a newly created</p>
      <p>Legend
Personalisation</p>
      <p>Concept
Indicator
Concept
Dimension</p>
      <p>Concept
rdfs:subClassOf
hasPracticable</p>
      <p>Activity</p>
      <p>Building</p>
      <p>Administrator
hasPracticable</p>
      <p>Activity</p>
      <p>Monitoring
Pollution Levels</p>
      <p>involves
Summation</p>
      <p>Formula
Sum</p>
      <p>hasFormula Air Pollution
hasAggregationFunction Indicator
hasUnitOfMeasure belongsTo
ppm</p>
      <p>Environment</p>
      <p>Exploration
context
Citizen</p>
      <p>Monitoring</p>
      <p>Pollution Levels</p>
      <sec id="sec-3-1">
        <title>NOX Indicator</title>
        <p>hasDimension
hasDimension</p>
        <sec id="sec-3-1-1">
          <title>CO2 Indicator hasDimension</title>
          <p>hasDimension
Household</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>CO2Indicator hasDimension</title>
          <p>takesDataFrom</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>CO2Heaters</title>
        <p>Spatial
Dimension
hasLevel
hasLevel rollUp</p>
        <p>District
City
rollUp
Building
rollUp
Apartment
indicator, a Formula (that, for a composite indicator, reports how to calculate it in terms of its
component indicators), the UnitOfMeasure and the AggregationFunction are specified.
Link to dimensional hierarchies. Once an indicator has been modelled, it must be bound to
one or more dimensional hierarchies. The data analyst may reuse previously created hierarchies
or define new ones, relying on the pivotal concepts Dimension and Level from the MDO.
Definition of personalisation concepts. The semantic representation of indicators is further
enriched by associating them with their target domains (e.g., environment, health) through
the belongsTo relationship. Personalisation concepts derived from the MDO are employed to
afirm that the awareness of certain indicators impacts particular tasks, requiring end-users
to base their decisions on these indicators (e.g., building monitoring, check air pollution).
This is achieved by binding the indicator to a UserCategory and an Activity (or one of
their sub-concepts) from the MDO. In particular, the hasPracticableActivity relationship
binds a UserCategory to an Activity. Finally, the involves semantic relationship links an
Activity to one or more Indicators.</p>
        <p>
          Validation of the created indicator. To assist the data analyst in the modelling task, several
constraints are checked through validation rules defined in the MDO: (i) a valid activity involves
at least one indicator; (ii) a valid dimension hierarchy, being associated with an indicator, must
gather at least one dimension level; (iii) a valid indicator belongs to at least one domain, is
explorable according to at least one dimension hierarchy, possibly has a unit of measure and is
involved in at least one activity. The interested reader can find the formulation of the validation
rules in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Example. In Figure 2, the AirPollutionIndicator is described as a sum of other indicators,
has ppm as unit of measure and is linked with the Environment domain. HouseholdCO2 is an
example of composite indicator, specialised from CO2Indicator and computed starting from
CO2Heaters indicator. All the indicators are associated with SpatialDimension, articulated
over the Apartment, Building, District and City levels and connected each other by
rollUp relationship. Similarly, indicators are associated with the TimeDimension (not shown
here). Lastly, indicators can be explored by both citizens and building administrators (modelled
through corresponding concepts) while performing MonitoringPollutionLevels activity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Identification of Personalised Exploration Contexts</title>
      <p>Once the Exploration Graph  has been created, the Data Lake can be explored by relying
on: (i) Multi-Dimensional Descriptors, apt to model multi-dimensional basic elements on which
exploration is performed; (ii) Exploration Contexts, that identify portions of the Exploration
Graph containing indicators compliant with users’ activities; (iii) contextual preferences, to
suggest to the user the most promising indicators to start the exploration from within an
Exploration Context.</p>
      <p>Multi-Dimensional Descriptors. Navigating across the Exploration Graph may be unpractical
with a growing number of nodes and edges. Hence, to explore the indicators in , we foster a
strategy grounded on the assumption that users inherently explore data according to a
multidimensional organisation. In this respect, we defined proper Multi-Dimensional Descriptors
over  (MDDs), to provide a compact representation of indicators and their dimensional levels.
Figure 2 highlights two examples of MDDs for the CO2Heaters indicator.</p>
      <p>Exploration contexts. Personalised exploration of MDDs is modelled through a set of soft
constraints contained in users’ profiles () for each  in the set of users  . Soft constraints
are modelled as preferences, organised according to Exploration Contexts, that represent the
situations in which the user explores the MDDs, influenced by both his/her roles and goals. An
Exploration Context  is used to delimit a portion  of the Exploration Graph , explorable
by the user . Available contexts are derived from  considering all the distinct pairs of
UserCategory and Activity (sub-)concepts. At exploration time, a context  can be
bound to one or more users’ profiles. Users can manage their profile by selecting/changing the
context of interest, choosing it from the ones compliant with their role(s).</p>
      <p>
        Contextual preferences. The portion  delimited by an Exploration Context  may
contain a high number of indicators, especially when considering a generic activity (such
as “pollution monitoring”). To cope with this issue, contextual preferences help suggesting
the user the indicators which best fit his/her demands. Contextual preferences can be either:
(a) short-term preferences, expressed by the user at exploration time, representing imminent
exploration needs; (b) long-term preferences, stored in user’s profile, which are assumed to be
static or change slowly over time. Contextual preferences are expressed on the set of MDDs
derived from  through diferent constructors that rank indicators based on: (i) the distance that
indicators have in the hierarchy induced by rdfs:subClassOf relationships (IND constructor);
(ii) the distance that dimensional levels have in the hierarchy induced by rollUp relationships,
focusing on a specific dimension ( LEV constructor); (iii) the fact that an indicator belongs to
a given domain (DOM constructor). Formalisation details of the constructors are available
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The rationale behind these constructors is that the user can express his/her preferences
on MDDs by relying on the relationships between MDDs and other concepts within his/her
Exploration Contexts. Base constructors can be in turn combined using the Pareto composition
(⊗ ), composing two preferences with equal priority, and the prioritization (▷) operator [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>(a)
ESxmpalortraCtiitoynInEdnigcianteors LAoglioceut
Home</p>
      <p>Logged in as Citizen. To change role, logout and
choose a different role in the login page</p>
      <p>Exploration Context Selection ?
Role Citizen
Activity MonitoringPollutionLevels
Forumlate Request for Indicators
dDoemsidaeinred (No selection)
iDnedsicidaetorerd CO2Indicator
Desidered
dimension (No Selection)
Edit Long-Term Preferences</p>
      <p>?
Add...</p>
      <p>Add...</p>
      <p>Add...</p>
      <p>Find Indicators</p>
      <p>(b)
ESxmpalortraCtiitoynInEdnigcianteors LAoglioceut
Home / Results Visualisation</p>
      <p>Current Exploration Context: Citizen
(MonitoringPollutionLevels)
Available results ?
Eepvxrepaflrueerasetseniodcne I(DNLOEDVM((CS(OEpna2vtIiniardolDincimamteoenrn)ts ) i) o  n  , District)   
Indicators returned Displayed results: 1-4</p>
      <p>CO2Indicator (District, Year)
CO2Indicator (District, Quarter)
CO2Indicator (District, Month)
CO2Indicator (District, Day)</p>
      <p>More Results... 1 2 3</p>
      <p>Browse selected indicator</p>
    </sec>
    <sec id="sec-5">
      <title>5. Implementation and experimental evaluation</title>
      <p>
        In this section, we present an excerpt of the implementation and the experimental evaluation
conducted in the scope of the Brescia Smart Living project, wherein three diferent typologies
of end-users have been identified as targets for the personalised exploration of indicators:
(i) citizens, willing to explore aggregated data related to their neighbourhood (for example,
average energy consumption, air quality, neighbourhood safety); (ii) property managers,
administering one or more apartment buildings; (iii) technical user plant managers, responsible
for heat distribution in buildings. In particular, we focus here on presenting the procedure
for personalised exploration of indicators, achieved through a prototype web-based GUI
(Figure 3). The usability of the GUI (in terms of facility in finding and exploring indicators)
has been tested by a representative group of 10 users, belonging to the three aforementioned
typologies, completing a standard System Usability Scale (SUS) questionnaire [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Averaged
results from SUS questionnaires positioned the prototype in the 90-95 percentile range of
the SUS score curve. Personalised indicators exploration has been performed on the top of a
Data Lake infrastructure, relying on the Apache Hadoop File System (HDFS). The Data Lake
internally adheres to a zone-based organisation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where each zone represents a stage of
data processing. However, since the PERSEUS approach is meant to be employed on the top of
the Data Lake infrastructure, it is agnostic with respect to the inner organisation strategy of
the Data Lake, that depends on the non-functional requirements of application domain. For
instance, when data quality issues have to be considered, a Bronze-Silver-Gold organisation may
be more suitable with respect to a zone-based organisation.
      </p>
      <p>
        GUI for personalised indicators exploration. Personalised indicators exploration is
articulated over four main steps executed through a web-based GUI (Figure 3). To present the four
steps (S1-S4), we consider Alice, a citizen interested in monitoring air pollution levels to decide
whether or not to practise outdoor activities, since pollution has efect on this kind of activities.
(S1) Exploration context selection – The exploration platform proposes Alice to select one of
the Exploration Contexts available for her profile. Figure 3(a) shows the selection of the
AirPollutionMonitoring activity.
(S2) Short-term preferences formulation. – In this step, Alice chooses the desired indicators,
domains and dimensional levels, and the corresponding concepts are mapped by the platform
to DOM, IND and LEV base preference constructors. For instance, in Figure 3(a), when Alice
selects the CO2Indicator, the corresponding IND preference constructor will be automatically
included in the request. The obtained constructors constitute the short-term preferences.
(S3) Short-term and long-term preferences combination. – Short-term preferences in the request
are combined with long-term preferences in the profile () of the user, holding within the
Exploration Context, thus leading to the compound preference P. Long-term preferences are
automatically combined using the Pareto composition operator, since they all assume an equal
importance for Alice. Short-term preferences are combined with long-term ones according to
the prioritization operator (▷), as they address an immediate need. After the request formulation
has been finalised, Alice confirms her choices by clicking the “Find Indicators” button.
(S4) Preference evaluation and indicators exploration. – The compound preference P from the
previous step undergoes an evaluation process to identify the set of best (optimal) MDDs
according to P. Such MDDs are proposed to the user, who can select any of them to explore
indicator values. For example, Alice’s preference evaluation result is displayed in the first page
of the list in Figure 3(b). Finally, Alice selects one of the MDDs (by clicking on the “Browse
selected indicator” button) and the multi-dimensional query apt to retrieve indicator values will
be issued over the underlying Semantic Data Lake. As detailed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the query is automatically
generated from the selected MDD and contains: (i) a projection clause, with target indicator and
analysis dimensions; (ii) the aggregation function; (iii) a selection clause (to restrict data access,
according to the user’s profile); (iv) the calculation formula. A set of mappings associated with
the MDD (defined by the data analyst) allows to circumscribe a portion of the catalog over
concepts that annotate the attributes involved in the query. The approach proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is
leveraged to create a query plan aimed at retrieving indicators values from Data Lake sources.
Experiments on personalisation efectiveness. To demonstrate the benefits (efectiveness) of
personalisation in suggesting relevant indicators (MDDs) to the user, we used the two renowned
metrics of Top- precision and Top- recall, as they are the most widely used metrics for the
evaluation of retrieval systems. The efectiveness of a personalised search for indicators depends
on users’ profiles and, more specifically, on the preferences contained within. In this respect,
two types of profiles, difering in the number of preferences, have been considered for ranking
≈ 3000 MDDs generated from 223 indicators: (i) 1, containing only a single preference and (ii)
2, a richer profile containing three preferences. Results for diferent values of Top-  MDDs are
reported in the right side of Figure 3. In particular, the Top- recall increases as long as the value
of  increases and, for the same value of , the profile with more preferences achieves a higher
recall. Thus, a richer profile (i.e., with more personalisation elements) enables a more efective
retrieval of relevant indicators (higher Top- recall) and, as witnessed by the experiments
conducted in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], delivers a higher selectivity of MDDs.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>
        In this section, we will analyse an excerpt of the literature based on the requirements demanded
by each phase of the PERSEUS approach. Regarding Semantic Data Lake modelling research,
the focus of the latest years has been on the formalisation of models for supporting knowledge
extraction from Data Lakes, building a semantic overlay with diferent techniques (e.g., by
grouping similar attributes for easing querying data sources [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or by building thematic views
on the data sources, annotating their attributes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]). Concerning the design of indicators,
ontologies have been widely used due to their shared and machine-understandable
conceptualisation. Recent eforts propose ontology-driven approaches to model KPIs, emphasising the
importance of correlation between indicators values [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], possibly including personalisation
concepts to drive the exploration of indicators (e.g., in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], to explore sensors network data).
Shifting towards data exploration issues, the usage of qualitative preferences yields higher
expressiveness with respect to quantitative ones in assuring a (strict) partial order of search
results. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], SPARQL qualitative preference queries are translated into query over relational
databases systems, whereas in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] preferences are formulated over aggregation levels of facts
in a Data Warehouse ecosystem.
      </p>
      <p>
        Novel contributions. PERSEUS aims at proposing a combined engineering of diferent
techniques for addressing Semantic Data Lake exploration. With respect to [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], PERSEUS
fosters a preliminary lexical enrichment of data sources using both a lexical database and an
abbreviation dictionary for building the semantic metadata catalog. In the second phase, the
approach supports the definition of indicators also considering the activities performed by
users while exploring data. These personalisation aspects in indicators modelling are only
partially treated in [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. In the third phase, with respect to [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ], PERSEUS exploits users’
preferences to rank indicators relying on their semantic definition, instead of actual values,
which only at a later time are retrieved, thus saving cost and resources to query data sources.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Concluding remarks</title>
      <p>
        In this paper, we presented PERSEUS, a computer-aided approach for data exploration on top of
a Semantic Data Lake. The approach is structured over three phases: (i) the construction of a
semantic metadata catalog on top of the Data Lake; (ii) the creation of an Exploration Graph,
based on metadata catalog, containing the semantic representation of indicators and analysis
dimensions; (iii) the enrichment of the definition of indicators with personalisation aspects (based
on users’ profiles and preferences) to identify Exploration Contexts, in turn delimiting portions
of the Exploration Graph for a personalised and interactive exploration of indicators. Results
of an experimental evaluation in the scope of the Brescia Smart Living project are presented
with the aim of demonstrating the feasibility of the approach. Each phase of the PERSEUS
approach paves the way to further investigation. For instance, regarding preference-based
indicators exploration, we will enhance the preference model by considering the propagation of
preferences across Exploration Contexts, as proposed by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], thus establishing how preferences
holding in a more generic Exploration Context are propagated to a more specific context.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          , E. Zhu,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          , P. C.
          <article-title>Arocena, Data Lake Management: Challenges and Opportunities</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>1986</fpage>
          -
          <lpage>1989</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          , V. De Antonellis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garda</surname>
          </string-name>
          ,
          <article-title>A semantics-enabled approach for personalised data lake exploration</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>66</volume>
          (
          <year>2024</year>
          )
          <fpage>1469</fpage>
          -
          <lpage>1502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          et al.,
          <article-title>Data Management Challenges for Smart Living</article-title>
          ,
          <source>in: Proc. of Cloud Infrastructures</source>
          ,
          <article-title>Services, and IoT Systems for Smart Cities (IISSC</article-title>
          <year>2017</year>
          ),
          <year>2017</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>STANDS4</given-names>
            <surname>Web Services: Abbreviations</surname>
          </string-name>
          <string-name>
            <surname>API</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://www.abbreviations.com/ abbr_api.php,
          <source>Accessed on March</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>WordNet: a lexical database for English</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.Y.</given-names>
            <surname>Vandenbussche</surname>
          </string-name>
          et al.,
          <article-title>Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web, Semantic Web 8 (</article-title>
          <year>2017</year>
          )
          <fpage>437</fpage>
          -
          <lpage>452</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Protégé</surname>
            <given-names>:</given-names>
          </string-name>
          <article-title>a free, open-source ontology editor and framework for building intelligent systems</article-title>
          ,
          <year>2024</year>
          . URL: https://protege.stanford.edu/,
          <source>Accessed on March</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garda</surname>
          </string-name>
          ,
          <article-title>A Semantics-Enabled Approach for Personalised Data Lake Exploration</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Brescia - Italy,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kießling</surname>
          </string-name>
          ,
          <article-title>Foundations of Preferences in Database Systems</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Very Large Databases (VLDB</source>
          <year>2002</year>
          ),
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bangor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Kortum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>An Empirical Evaluation of the System Usability Scale, Intl</article-title>
          .
          <source>Journal of Human-Computer Interaction</source>
          <volume>24</volume>
          (
          <year>2008</year>
          )
          <fpage>574</fpage>
          -
          <lpage>594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Giebler</surname>
          </string-name>
          , et al.,
          <article-title>A zone reference model for enterprise-grade data lake management</article-title>
          ,
          <source>in: 2020 IEEE 24th Int. Enterprise Distributed Object Computing Conf.</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>H. B. Hamadou</surname>
            , E. Gallinucci,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>Answering</surname>
            <given-names>GPSJ</given-names>
          </string-name>
          <article-title>Queries in a Polystore: A Dataspace-Based Approach</article-title>
          , in
          <source>: Proceedings of the International Conference on Conceptual Modeling (ER</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Mami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Graux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scerri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Squerall: Virtual OntologyBased Access to Heterogeneous and Large Data Sources</article-title>
          ,
          <source>in: Proceedings of 18th International Semantic Web Conference (ISWC</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Diamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lo Giudice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Storti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ursino</surname>
          </string-name>
          ,
          <article-title>An Approach to Extracting Topic-guided Views from the Sources of a Data Lake</article-title>
          ,
          <source>Information Systems Frontiers</source>
          <volume>23</volume>
          (
          <year>2021</year>
          )
          <fpage>243</fpage>
          --
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. del Mar</surname>
            Roldán-García,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>García-Nieto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Maté</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Trujillo</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Aldana-Montes</surname>
          </string-name>
          ,
          <article-title>Ontology-driven approach for KPI meta-modelling, selection and reasoning</article-title>
          ,
          <source>International Journal of Information Management</source>
          <volume>58</volume>
          (
          <year>2019</year>
          )
          <fpage>102018</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Hippolyte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rezgui</surname>
          </string-name>
          ,
          <article-title>The UDSA ontology: An ontology to support real time urban sustainability assessment</article-title>
          ,
          <source>Advances in Engineering Software</source>
          <volume>140</volume>
          (
          <year>2020</year>
          )
          <fpage>102731</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Handling qualitative preferences in sparql over virtual ontology-based data access</article-title>
          ,
          <source>Semantic Web</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>659</fpage>
          -
          <lpage>682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          , P. Biondi,
          <article-title>myOLAP: An Approach to Express and Evaluate OLAP Preferences</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>23</volume>
          (
          <year>2010</year>
          )
          <fpage>1050</fpage>
          -
          <lpage>1064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciaccia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinenghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Foundations of Context-aware Preference Propagation</article-title>
          ,
          <source>Journal of the ACM (JACM) 67</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>