=Paper=
{{Paper
|id=Vol-1962/paper_4
|storemode=property
|title=Enabling Data Analytics from Knowledge
                        Graphs
|pdfUrl=https://ceur-ws.org/Vol-1962/paper_4.pdf
|volume=Vol-1962
|authors=Henrique Santos
|dblpUrl=https://dblp.org/rec/conf/semweb/Santos17
}}
==Enabling Data Analytics from Knowledge
                        Graphs==
<pdf width="1500px">https://ceur-ws.org/Vol-1962/paper_4.pdf</pdf>
<pre>
                     Enabling Data Analytics
                     from Knowledge Graphs

                                   Henrique Santos

                  Universidade de Fortaleza, Fortaleza, CE, Brazil
                               hos@edu.unifor.br


      Abstract. Scientific data is being acquired in high volumes in support
      of studies in many knowledge areas. Regular data analytics processes
      make use of datasets that often lack enough knowledge to facilitate the
      work of data scientists. By relying on knowledge graphs (KGs), those
      difficulties can be mitigated. This research focuses on enabling data ana-
      lytics over scientific data in light of knowledge available in KGs, providing
      access, based on queries, to scientific data points in KGs to data users
      while making use of available knowledge to facilitate their data analytics
      activities.

      Keywords: knowledge graphs, data analytics, data access


1   Problem statement
Scientific data (observation/simulation data) is being generated and acquired in
high volumes in support of studies in many knowledge areas and industry sectors.
Datasets containing raw or curated scientific data points are often used as input
for data analytics pipelines. For example, in the Smart City domain, a recent
study [13] aims to identify potential public transportation users that are having
unsatisfactory experience while making use of buses in populous metropolitan
areas. As there is no straightforward way to know if a user is riding a bus in a
moment that is overcrowded, late or that may maximize the number of needed
transfers, the study makes use of ticket validation, bus stops, bus routes, bus
schedules and traffic information datasets which need to undergo data cleansing,
data mining and visualization techniques before actually being used in support
of the desired objective.
    In this scenario, the problem arises when the knowledge behind the data is
lost during ordinary data acquisition and preparation activities [21] and, thus,
it is not available for field specialists and scientists (i.e. data users) to work
with, leading those professionals to perform burdensome tasks [15] in order to
make sure the data they are working with are suitable for the desired goals.
More than that, the lack of metadata usually restricts data users that have no
prior background knowledge about it, not leveraging potential applications for
the data.
    Preliminarily, we have identified the following problems a data user faces
when trying to perform analytics over scientific datasets:
2         Henrique Santos

    – How to successfully find all the relevant data among massive data collections?
    – How to compare/combine two (or more) variables that measure the same
      characteristic but were acquired using different instruments each one with
      its own resolution, precision and accuracy?
    – How to allow data users, with no prior knowledge about the data, to suc-
      cessfully use it and leverage new applications?
    Recently, the use of Knowledge Graphs (KGs) is on the rise as a way of
building large knowledge bases as a graph structure. Those graphs aim to rep-
resent knowledge as a series of statements as triples, in the form of subject −
predicate − object. Until now, KG common usages include enhancing search [1]
and performing A.I. tasks like Question Answering [17], Natural Language Pro-
cessing [7] and Machine Learning [18]. In contrast, there exists an increasing
number of approaches for building domain sciences [2, 6, 9], Internet of Things
(IoT) [16] and city [23] KGs (in which scientific data are present), with the
intent of encoding provenance, context and further knowledge behind each sci-
entific data point. Nevertheless, when confronted with above listed problems,
the state-of-the-art approaches do not perform well as they are either focused
on annotation [2], general use [9] or real-time querying [16]. As a consequence,
data users rely yet on the aforementioned scientific datasets that usually lack
enough knowledge to facilitate their data understanding and preparation.
    This Ph.D. research proposal settles itself on the problem of enabling data
analytics over scientific data in light of knowledge available in KGs that describe
studies generating scientific data. More specifically, to provide access, based on
queries, to scientific data points in KGs to data users while making use of avail-
able knowledge to facilitate their data analytics activities. This objective poses
a number of challenges, among them:

    – Domain modeling: The development of domain ontologies for the Semantic
      Web has been historically use-case driven [4, 11], but the data analytics use-
      case has not yet been fully explored.
    – Provenance, contextual knowledge and uncertainty: Instruments char-
      acteristics (resolution, accuracy, precision), agent interventions over them
      (deployments, calibrations, configurations) and detectors faults are exam-
      ples of what can directly affect scientific data values. This knowledge needs
      to be tracked and explored to provide data users trustworthy results.
    – Knowledge Graph data access: Routine data tools (R, Python, Weka,
      Gephi, Business Intelligence softwares etc.) used in support of data analyt-
      ics activities often expect tabular data as input, not coping properly with
      Semantic Web technologies and formats.


2      Relevancy
Data preparation is estimated to take around 80% of the whole analytical pipeline
[19], with tasks like data understanding and data cleaning requiring a great effort.
Hence, the actual analysis activities, which indeed extract new knowledge from
                          Enabling Data Analytics from Knowledge Graphs         3

the data, are delayed and/or shortened, directly impacting outcomes quality and
projects deadlines. We aim to simplify this process by providing specifications
and tools. Given this, we expect this research to bring straight benefits to data
scientists and field specialists.
    Interoperability between scientific KGs and existing non-semantic tools should
broaden the use of KGs to even more knowledge areas, as working with data con-
tained in it will be made easier.


3   Related work
The “Knowledge Graph” term gained popularity with the announcement1 of
The Knowledge Graph by Google in an effort to merge Freebase[5], Wikipedia
and the CIA World Factbook2 augmented with their search engine’s queries
and results. Since then, a number of existing projects have been categorized as
KGs. For instance, YAGO [25], DBpedia [3] and Wikidata [26] are free general-
purpose KGs, while the Gene Ontology [2], Bio2RDF [6] and KnowLife [9] are
aimed toward life sciences. All of those approaches have focused on the problem
of encoding knowledge and KG building, not particularly concerned about how
to cope with data analytics activities.
     The Graph of Things [16] is a proposed KG for integrating heterogeneous
IoT data sources that enables querying and visualization through an SPARQL
endpoint. This approach makes use of the SSN ontology3 to describe physical
sensing instruments and their observed data with some metadata including sen-
sor configuration and measured characteristic. However, the sole ways to work
with data contained in this KG is to either use its SPARQL endpoint or a
stream subscribing channel that provides continuous queries over RDF stream
data which, therefore, makes this approach not suited for data analytics, lacking
interoperability with existing data tools. The work in [8] describes an approach
that integrates heterogeneous data sources into an RDF KG for predictive anal-
ysis. The presented system is capable of providing an SPARQL query interface
for preparing datasets for different tools in the context of predictive analysis.
     There exists a number of approaches tackling data analytics challenges re-
lated to the Smart City context using city published data. CityPulse [22] is a
framework that enables development of applications in support of cities, by pro-
viding integration mechanisms for urban data streams. The ISO 37120:2014 [14]
is a standard that defines 100 indicators across 17 themes that were evaluated
to be a precise way to measure a city’s performance of its services and quality of
life. The themes span areas including Economy, Education, Health, and Safety.
The main goal of this standard is to provide a concise set of well-defined global
indicators that any city can use to measure itself. Moreover, cities that adhere
to this standard are able to compare themselves and evaluate how well they are
1
  https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-
  things-not.html
2
  https://www.cia.gov/library/publications/the-world-factbook
3
  https://www.w3.org/TR/vocab-ssn/
4         Henrique Santos

doing in comparison to others. Relying on the RDF model and making use of
the ISO standard, the PolisGnosis Project [10] is a final goal of an ongoing effort
by the University of Toronto. The project aims the following:

    – To provide a description of all the 100 ISO indicators in terms of ontologies
      for the semantic web;
    – To develop an engine capable of performing analysis in order to discover root
      causes of differences concerning why indicators change over time for a given
      city and why they are different between different cities.

   Until the time of this writing, the PolisGnosis Project has focused largely
on the GCI Ontology engineering[11] as a standard to publish the ISO indicator
values.


4      Research questions

Given the identified challenges and limitations presented in the previous sections,
we have formulated the following research questions that we intend to answer:

Q1 Can ontologies be used to successfully bridge the knowledge gap between ac-
   quired scientific data and data users? If so, how?
   Existing scientific domain ontologies are not aligned with data analytics re-
   quirements. For instance, when calculating indicators, related concepts may
   suggest that a certain data point should be taken into account and these rela-
   tions are not always present because the ontology was developed for another
   purpose.
Q2 Will data users and applications benefit from the use of knowledge behind
   each scientific data point?
   Common search mechanisms only index dataset metadata, returning com-
   plete datasets that may be in a plethora of different formats.
Q3 How to provide data access for scientific KGs in a way that can be consumed
   by routine data tools while making use of the attached data knowledge to
   facilitate analytics?
   Current RDF serialization formats include Turtle, JSON-LD and RDF/XML
   which are not suited for most data tools, while SPARQL querying requires
   previous knowledge on the ontologies used in the KG.


5      Hypotheses

Our hypotheses derive directly from the questions above:

H1 The reuse of scientific data ontologies with proper extensions and their align-
   ments to domain ontologies can mitigate the loss of knowledge during data
   acquisition.
                          Enabling Data Analytics from Knowledge Graphs        5

H2 Providing data points together with their knowledge (e.g. provenance, con-
   textual knowledge) to data users and applications can facilitate data analyt-
   ics.
H3 A hybrid RDF serialization format that suits the needs of existing data tools
   but also is able to convey knowledge can be used to serialize data from KGs
   together with its associated metadata.
H4 A query API for scientific KGs can also be used to output data together
   with its associated metadata for facilitating data analytics.


6      Preliminary results

In [24], we described our first approach at a process of data acquisition and KG
building in the context of urban mass transportation where data was produced
by GPS devices deployed on buses from the city of Fortaleza, Brazil. The built
KG was suited for metadata-driven faceted-search over the data, which enabled a
better understanding of the data contained in the KG by the explicit information
about context provenance. This work was our first approach of putting in place
enough relevant metadata and this was accomplished by the use of our HAScO
Ontology4 (which evolved from HASNetO [21]) as a way of describing content
and context of the acquired data.
    Following, in [23], we presented an operational description of a KG that
supports automatic generation of dashboards along with an indicator ontology
that supports data visualization techniques. This work extends the previous
one by providing a first data analytics use-case where data in KG is used to
produce rich visualizations in dashboards that are automatically built based on
the knowledge we have put in place in the KG and an indicator ontology.
    Both works make use of the proposed CCSV (Contextualized CSV) format
which we have designed to support not only raw files from data acquisition
instruments prior to be turned into knowledge in the KG, but also as an output
format for data in KGs. A CCSV file is a regular CSV file with a Turtle preamble
on top of it, which links the file contents (registers and columns) to a domain
ontology, thus preserving the semantics associated with the data. The CCSV
format has shown promising results as a way to bridge the gap between KGs
and data tools.


7      Approach

The main idea behind our approach is to provide a specification for the construc-
tion of a scientific KG along with processes in support of data analytics. The
key innovation and novel contribution is the ability to use knowledge in the KG
to provide data access and prepare datasets for data analytics, based on user
queries.
 4
     http://hadatac.org/ont/hasco#
6      Henrique Santos

    In order to tackle Q1, we are gathering scientific data analytics requirements
working in conjunction with data users in two domain areas: environmental and
urban. Based on the requirements, we are reusing and extending ontologies that
we believe will be capable of composing a base knowledge layer that can be ex-
ploited by processes that aim to facilitate data analytics. Currently, we are using
HAScO as our base ontology for a scientific KG as it makes use of PROV-O for
provenance tracking alongside VSTO-I [12] and proper extensions for registering
contextual knowledge. For Q2, in its turn, we are creating processes that are
able to retrieve desired data points by using the knowledge from our scientific
HAScO-based KGs based on user queries. With that, we intend data users to
have direct access to data points instead of complete datasets in order to produce
more reliable data analytics.
    Data in KGs are in triples format which is good for representing knowledge
but not so for data analytics tools which, most of the times, expect tabular data.
For Q3, we are working with two distinct approaches. First, we have discussed
the CCSV format in the previous section, which is able to be a way of serializing
data from KG together with its associated knowledge with the capability of
serving as an input format for intelligent applications that can take advantage of
that, as demonstrated by our preliminary results. We are continuously expanding
the format to handle new data analytics use-cases. Secondly, we are also studying
how to provide a programmatically way of accessing the desired data for analytics
from tools that support this feature.


8   Evaluation plan
We intend to validate H1 using state of the art KG evaluation approaches dis-
cussed in [20]. For H2, we are gathering data analytics use cases and assessing
how the associated metadata facilitates the use of the data.
   Ultimately, for H3 and H4, we intend to perform tests with data scientists
and field specialists acting as users of our proposed KG and processes. Using
their data (preferably from different studies and sources), we intend to build
a scientific KG adding the relevant metadata and then provide them tools for
querying the data and preparing datasets for their routine data analytics. Then,
questionnaires will be applied to measure how much our approach has eased
their tasks in contrast with their regular processes.


9   Reflections
To conclude, we have identified from the state of the art approaches that the
task of promoting data analytics from scientific data in KGs is still in its early
stages. With this research, we intend to push this forward by proposing a KG
specification that not only is capable of tracking all the contextual knowledge
that is lost during data acquisition activities but is aligned with data analytics
requirements. More than that, we intend to exploit knowledge in KGs to be
able to return data points directly related to user queries instead of complete
                             Enabling Data Analytics from Knowledge Graphs              7

datasets. Given this, we expect the outcome of this research to dramatically
decrease the data preparation efforts.


Acknowledgments

Advised by Prof. Vasco Furtado. Further thanks to Dr. Paulo Pinheiro and
Prof. Deborah L. McGuinness for the cooperation and invaluable feedback on
this work.


References

 1. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D.:
    Faceted search over RDF-based knowledge graphs. Web Semantics: Science, Ser-
    vices and Agents on the World Wide Web 37–38, 55–74 (Mar 2016)
 2. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M.,
    Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-
    Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald,
    M., Rubin, G.M., Sherlock, G.: Gene Ontology: tool for the unification of biology.
    Nature Genetics 25(1), 25–29 (May 2000)
 3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
    A Nucleus for a Web of Open Data. In: The Semantic Web, pp. 722–735. No. 4825
    in Lecture Notes in Computer Science (Jan 2007)
 4. Beisswanger, E., Schulz, S., Stenzhorn, H., Hahn, U.: BioTop: An upper domain
    ontology for the life sciences. Applied Ontology 3(4), 205–212 (Jan 2008)
 5. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A Collabora-
    tively Created Graph Database for Structuring Human Knowledge. In: Proceedings
    of the 2008 ACM SIGMOD International Conference on Management of Data. pp.
    1247–1250 (Jun 2008)
 6. Callahan, A., Cruz-Toledo, J., Ansell, P., Dumontier, M.: Bio2rdf Release 2: Im-
    proved Coverage, Interoperability and Provenance of Life Science Linked Data. In:
    The Semantic Web: Semantics and Big Data. pp. 200–212 (May 2013)
 7. Chen, Y.N., Wang, W.Y., Rudnicky, A.: Jointly Modeling Inter-Slot Relations by
    Random Walk on Knowledge Graphs for Unsupervised Spoken Language Under-
    standing. In: Proceedings of the 2015 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies.
    pp. 619–629 (Jun 2015)
 8. Duan, W., Chiang, Y.Y.: Building Knowledge Graph from Public Data for Pre-
    dictive Analysis: A Case Study on Predicting Technology Future in Space and
    Time. In: Proceedings of the 5th ACM SIGSPATIAL International Workshop on
    Analytics for Big Geospatial Data. pp. 7–13 (Oct 2016)
 9. Ernst, P., Siu, A., Weikum, G.: KnowLife: a versatile approach for constructing a
    large knowledge graph for biomedical sciences. BMC Bioinformatics 16, 157 (May
    2015)
10. Fox, M.S.: PolisGnosis Project: Representing and Analysing City Indicators.
    Working Paper, Enterprise Integration Laboratory, University of Toronto (May
    2015),     http://eil.utoronto.ca/wp-content/uploads/smartcities/papers/
    PolisGnosis.pdf
8       Henrique Santos

11. Fox, M.S.: The role of ontologies in publishing and analyzing city indicators. Com-
    puters, Environment and Urban Systems 54, 266–279 (Nov 2015)
12. Fox, P., McGuinness, D.L., Cinquini, L., West, P., Garcia, J., Benedict, J.L.,
    Middleton, D.: Ontology-supported scientific data frameworks: The Virtual Solar-
    Terrestrial Observatory experience. Computers & Geosciences 35(4), 724–738 (Apr
    2009)
13. Furtado, V., Caminha, C., Furtado, E., Lopes, A., Dantas, V., Ponte, C., Caval-
    cante, S.: Increasing the Likelihood of Finding Public Transport Riders that Face
    Problems Through a Data-Driven approach. arXiv:1705.03504 [cs] (Apr 2017)
14. ISO: Sustainable development of communities – Indicators for city services and
    quality of life. ISO 37120:2014, International Organization for Standardization
    (May 2014), http://www.iso.org/iso/catalogue_detail?csnumber=62436
15. Jeffery, S.R., Alonso, G., Franklin, M.J., Hong, W., Widom, J.: A Pipelined Frame-
    work for Online Cleaning of Sensor Data Streams. In: 22nd International Confer-
    ence on Data Engineering (ICDE’06). pp. 140–140 (Apr 2006)
16. Le-Phuoc, D., Nguyen Mau Quoc, H., Ngo Quoc, H., Tran Nhat, T., Hauswirth,
    M.: The Graph of Things: A step towards the Live Knowledge Graph of connected
    things. Web Semantics: Science, Services and Agents on the World Wide Web
    37–38, 25–35 (Mar 2016)
17. Lopez, V., Tommasi, P., Kotoulas, S., Wu, J.: QuerioDALI: Question Answering
    Over Dynamic and Linked Knowledge Graphs. In: The Semantic Web – ISWC
    2016. pp. 363–382. Lecture Notes in Computer Science (Oct 2016)
18. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A Review of Relational Machine
    Learning for Knowledge Graphs. Proceedings of the IEEE 104(1), 11–33 (Jan 2016)
19. Patil, D.J.: Data Jujitsu: The Art of Turning Data into Product. O’Reilly Media,
    1 edn. (Nov 2012)
20. Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation
    methods. Semantic Web 8(3), 489–508 (Jan 2017)
21. Pinheiro, P., McGuinness, D.L., Santos, H.: Human-Aware Sensor Network Ontol-
    ogy: Semantic Support for Empirical Data Collection. In: Proceedings of the 5th
    Workshop on Linked Science. Bethlehem, PA, USA (Oct 2015)
22. Puiu, D., Barnaghi, P., Tönjes, R., Kümper, D., Ali, M.I., Mileo, A., Parreira, J.X.,
    Fischer, M., Kolozali, S., Farajidavar, N., Gao, F., Iggena, T., Pham, T.L., Nechi-
    for, C.S., Puschmann, D., Fernandes, J.: CityPulse: Large Scale Data Analytics
    Framework for Smart Cities. IEEE Access 4, 1086–1108 (2016)
23. Santos, H., Dantas, V., Furtado, V., Pinheiro, P., McGuinness, D.L.: From Data
    to City Indicators: A Knowledge Graph for Supporting Automatic Generation of
    Dashboards. In: The Semantic Web. pp. 94–108 (May 2017)
24. Santos, H., Furtado, V., Pinheiro, P., McGuinness, D.L.: Contextual Data Col-
    lection for Smart Cities. In: Proceedings of the Sixth Workshop on Semantics for
    Smarter Cities. Bethlehem, PA, USA (Oct 2015)
25. Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A Large Ontology from
    Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the
    World Wide Web 6(3), 203–217 (Sep 2008)
26. Vrandečić, D., Krötzsch, M.: Wikidata: A Free Collaborative Knowledgebase. Com-
    mun. ACM 57(10), 78–85 (2014)

</pre>