Instrumenting Continuous Knowledge
        Extraction, Sharing, and Benchmarking

                     Marco Brambilla, Emanuele Della Valle,
                     Andrea Mauri, and Riccardo Tommasini

                   Politecnico di Milano, DEIB, Data Science Lab.
                       Via Ponzio 34/5, I-20133, Milano, Italy
                             {name.surname}@polimi.it


       Abstract. Keeping the pace with the faster and faster evolution of
       knowledge is becoming a challenge, especially for researchers and knowl-
       edge workers. We propose a vision towards a set of (possibly integrated)
       publicly available tools that can help on this. To this purpose, we expect
       tools that can improve effectiveness of knowledge extraction, storage,
       analysis, publishing and experimental benchmarking. This could be ex-
       tremely beneficial for the entire research community across fields and
       interests. We describe our vision in this direction and we demonstrate
       its feasibility with some exemplary tools that we developed and that we
       shared as public resources to be used by the research community.


1    Introduction
                                           Nanos gigantum humeris insidentes
                                           (Bernard of Chartres, 1115 AD ca.)

    Science aims at creating new knowledge upon the existing one, from the
observation of physical phenomena, their modeling and empirical validation.
This combines the well known motto “standing on the shoulders of giants”
(attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton)
with the need of trying and validating new experiments.
    However, knowledge in the world continuously evolves, at a pace that cannot
be traced even by large crowdsourced bodies of knowledge such as Wikipedia.
A large share of generated data are not currently analysed and consolidated
into exploitable information and knowledge [1]. In particular, the process of
ontological knowledge discovery tends to focus on the most popular items, those
which are mostly quoted or referenced, and is less effective in discovering less
popular items, belonging to the so-called long tail , i.e. the portion of the entity’s
distribution having fewer occurrences [2].
    This becomes a challenge for practitioners, enterprises and scholars / re-
searchers, which need to be up to date to innovation and emerging facts. The
scientific community also need to make sure there is a structured and formal
way to represent, store and access such knowledge, for instance as ontologies or
linked data sources.
2          M. Brambilla, E. Della Valle, A. Mauri, R. Tommasini

    Our idea is to propose a vision towards a set of (possibly integrated)
publicly available tools that can help scholars keeping the pace with
the evolving knowledge. This implies the capability of integrating informal
sources, such as social networks, blogs, and user-generated content in general.
One can conjecture that somewhere, within the massive content shared by people
online, any low-frequency, emerging concept or fact has left some traces. The
challenge is to detect such traces, assess their relevance and trustworthiness,
and transform them into formalized knowledge [5].
    An appropriate set of tools that can improve effectiveness of knowledge ex-
traction, storage, analysis, publishing and experimental benchmarking could be
extremely beneficial for the entire research community across fields and interests.


2      Our Vision towards Continuous Knowledge Extraction
       and Publishing

We foresee a paradigm where knowledge seeds can be planted, and subsequently
grow, finally leading to the generation and collection of new knowledge, as de-
picted in the exemplary process in Figure 1 [2].


Fig. 1. Exemplary process of knowledge seeding, growing, and harvesting for extracting
concepts from social media.


   We advocate for a set of tools that, when implemented and integrated, enable
the following perspective reality:

    – possibility of selecting any kind of source of raw data, independently of their
      format, type or semantics (spanning quantitative data, textual content,
      multimedia content), covering both data streams or pull-based data sources;
    – possibility of applying different data cleaning and data analysis pipelines
      to the different sources, in order to increase data quality and abstraction /
      aggregation;
    – possibility of integrating the selected sources;
    – possibility of running homogeneous knowledge extraction processes of the
      integrated sources;
             Continuous Knowledge Extraction, Sharing, and Benchmarking         3

 – possibility of publishing the results of the analysis and semantic enrichment
   as new and further (richer) data sources and streams, in a coherent, standard
   and semantic way.
    This enables generation of new sources which in turn can be used in subse-
quent knowledge extraction processes of the same kind. The results of this pro-
cess must be available at any stage to be shared for building an open,
integrated and continuously evolving knowledge for research, innovation,
and dissemination purposes.


3   A Preliminary Feasibility Perspective
Whilst beneficial and powerful, the vision we propose is far from being achieved
nowadays. However, we are convinced that the vision is not out of reach in the
mid term. To give a hint of this, we report here our experience with the research,
design and implementation of a few tools that point in the proposed direction:
1. Social Knowledge Extractor (SKE) is a publicly available tool for
   discovering emerging knowledge by extracting it from social con-
   tent. Once instrumented by experts through very simple initialization, the
   tool is capable of finding emerging entities by means of a mixed syntactic-
   semantic method. The method uses seeds, i.e. prototypes of emerging entities
   provided by experts, for generating candidates; then, it associates candidates
   to feature vectors, built by using terms occurring in their social content, and
   then ranks the candidates by using their distance from the centroid of seeds,
   returning the top candidates as result. The tool can run continuously or
   with periodic iterations, using the results as new seeds. Our research on this
   has been published in [3], a simplified implementation is currently available
   online for demo purposes at:
   http://datascience.deib.polimi.it/social-knowledge/,
   and the code is available as open-source under an Apache 2.0 license on
   GitHub at:
   https://github.com/DataSciencePolimi/social-knowledge-extractor.

2. TripleWave is a tool for disseminating and exchanging RDF streams
   on the Web. At the purpose of processing information streams in real-
   time and at Web scale, TripleWave integrates nicely with RDF Stream Pro-
   cessing (RSP) and Stream Reasoning (SR) as solutions to combine seman-
   tic technologies with stream and event processing techniques. In particu-
   lar, it integrates with an existing ecosystem of solutions to query, reason
   and perform real-time processing over heterogeneous and distributed data
   streams. TripleWave can be fed with existing Web streams (e.g. Twitter and
   Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sen-
   sor Data dataset) and it can be invoked through both pull- and push-based
   mechanisms, thus enabling RSP engines to automatically register and re-
   ceive data from TripleWave. The tool has been described in [4] and the code
4       M. Brambilla, E. Della Valle, A. Mauri, R. Tommasini

    is available as open-source on GitHub at
    https://github.com/streamreasoning/TripleWave/.
 3. RSPlab enables efficient design and execution of reproducible ex-
    periments, as well as sharing of the results. It integrates two existing RSP
    benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL
    engine and CQELS). It provides a programmatic environment to: deploy in
    the cloud RDF Streams and RSP engines; interact with them using Triple-
    Wave and RSP Services; continuously monitor their performances and collect
    statistics. RSPlab is released as open-source under an Apache 2.0 license is
    currently under submission at ISWC - Resources Track and is available on
    GitHub at
    https://github.com/streamreasoning/rsplab.


4   Conclusions

We believe that knowledge intaking by scholars is going to become more and
more time consuming and expensive, due to the amount of knowledge that is
being built and shared everyday. We envision a comprehensive approach based
on integrated tools that allow data collection, cleaning, integration, analysis and
semantic representation that can be run continuously for keeping the formal-
ized knowledge bases aligned with the evolution of knowledge, with
limited cost and high recall on the facts and concepts that emerge
or decay. These tools do not need to be implemented by the same vendor or
provider; we instead advocate for opensource publishing of all the implementa-
tions, as well as for the definition of an agreed-upon integration platform that
allows them all to integrate appropriately.


5   Outlook on Research Resource Sharing

As we envisioned an ecosystem that includes, but is not limited to, modules for
extraction, sharing and benchmarking, two research questions require investiga-
tion in the immediate future.
    First, how can we design and publish new resources for such an ecosystem?
Do they exist already? It is important to understand what else is available out
there. Researchers commonly support their scientific studies with resources that
can benefit the whole community, if released. The release process must comply
with a scientific method that ensures repeatability and reproducibility. However,
a standard agreed-upon methodology that guide this process does not exists yet.
    Second, how should we combine these resources towards shared research
workflows? To investigate this research question, we need a platform that enables
researchers to deploy their resources and interact with the ecosystem. Therefore,
we call for an open discussion about how this integration should be done. Ref-
erences
              Continuous Knowledge Extraction, Sharing, and Benchmarking               5

References
1. Ackoff, R.L.: From data to wisdom. Journal of applied systems analysis 16(1), 3–9
   (1989)
2. Brambilla, M., Ceri, S., Daniel, F., Valle, E.D.: On the quest for changing knowledge.
   In: Proceedings of the Workshop on Data-Driven Innovation on the Web - DDI '16.
   ACM Press (2016), https://doi.org/10.1145%2F2911187.2914582
3. Brambilla, M., Ceri, S., Valle, E.D., Volonterio, R., Salazar, F.X.A.: Extracting
   Emerging Knowledge from Social Media. In: Proceedings of the 26th International
   Conference on World Wide Web - WWW '17. ACM Press (2017), https://doi.
   org/10.1145%2F3038912.3052697
4. Mauri, A., Calbimonte, J.P., Dell’Aglio, D., Balduini, M., Brambilla, M., Valle,
   E.D., Aberer, K.: TripleWave: Spreading RDF Streams on the Web. In: Lecture
   Notes in Computer Science, pp. 140–149. Springer International Publishing (2016),
   https://doi.org/10.1007%2F978-3-319-46547-0_15
5. Stieglitz, S., Dang-Xuan, L., Bruns, A., Neuberger, C.: Social Media Analytics.
   Business & Information Systems Engineering 6(2), 89–96 (feb 2014), https://doi.
   org/10.1007%2Fs12599-014-0315-7