<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Instrumenting Continuous Knowledge Extraction, Sharing, and Benchmarking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Brambilla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Della Valle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Mauri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Tommasini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano, DEIB, Data Science Lab. Via Ponzio 34/5</institution>
          ,
          <addr-line>I-20133, Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Keeping the pace with the faster and faster evolution of knowledge is becoming a challenge, especially for researchers and knowledge workers. We propose a vision towards a set of (possibly integrated) publicly available tools that can help on this. To this purpose, we expect tools that can improve e ectiveness of knowledge extraction, storage, analysis, publishing and experimental benchmarking. This could be extremely bene cial for the entire research community across elds and interests. We describe our vision in this direction and we demonstrate its feasibility with some exemplary tools that we developed and that we shared as public resources to be used by the research community.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Nanos gigantum humeris insidentes
(Bernard of Chartres, 1115 AD ca.)</p>
      <p>Science aims at creating new knowledge upon the existing one, from the
observation of physical phenomena, their modeling and empirical validation.
This combines the well known motto \standing on the shoulders of giants"
(attributed to Bernard of Chartres and subsequently rephrased by Isaac Newton)
with the need of trying and validating new experiments.</p>
      <p>
        However, knowledge in the world continuously evolves, at a pace that cannot
be traced even by large crowdsourced bodies of knowledge such as Wikipedia.
A large share of generated data are not currently analysed and consolidated
into exploitable information and knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, the process of
ontological knowledge discovery tends to focus on the most popular items, those
which are mostly quoted or referenced, and is less e ective in discovering less
popular items, belonging to the so-called long tail , i.e. the portion of the entity's
distribution having fewer occurrences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>This becomes a challenge for practitioners, enterprises and scholars /
researchers, which need to be up to date to innovation and emerging facts. The
scienti c community also need to make sure there is a structured and formal
way to represent, store and access such knowledge, for instance as ontologies or
linked data sources.</p>
      <p>
        Our idea is to propose a vision towards a set of (possibly integrated)
publicly available tools that can help scholars keeping the pace with
the evolving knowledge. This implies the capability of integrating informal
sources, such as social networks, blogs, and user-generated content in general.
One can conjecture that somewhere, within the massive content shared by people
online, any low-frequency, emerging concept or fact has left some traces. The
challenge is to detect such traces, assess their relevance and trustworthiness,
and transform them into formalized knowledge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>An appropriate set of tools that can improve e ectiveness of knowledge
extraction, storage, analysis, publishing and experimental benchmarking could be
extremely bene cial for the entire research community across elds and interests.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our Vision towards Continuous Knowledge Extraction and Publishing</title>
      <p>
        We foresee a paradigm where knowledge seeds can be planted, and subsequently
grow, nally leading to the generation and collection of new knowledge, as
depicted in the exemplary process in Figure 1 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>We advocate for a set of tools that, when implemented and integrated, enable
the following perspective reality:
{ possibility of selecting any kind of source of raw data, independently of their
format, type or semantics (spanning quantitative data, textual content,
multimedia content), covering both data streams or pull-based data sources;
{ possibility of applying di erent data cleaning and data analysis pipelines
to the di erent sources, in order to increase data quality and abstraction /
aggregation;
{ possibility of integrating the selected sources;
{ possibility of running homogeneous knowledge extraction processes of the
integrated sources;
{ possibility of publishing the results of the analysis and semantic enrichment
as new and further (richer) data sources and streams, in a coherent, standard
and semantic way.</p>
      <p>This enables generation of new sources which in turn can be used in
subsequent knowledge extraction processes of the same kind. The results of this
process must be available at any stage to be shared for building an open,
integrated and continuously evolving knowledge for research, innovation,
and dissemination purposes.
3</p>
    </sec>
    <sec id="sec-3">
      <title>A Preliminary Feasibility Perspective</title>
      <p>
        Whilst bene cial and powerful, the vision we propose is far from being achieved
nowadays. However, we are convinced that the vision is not out of reach in the
mid term. To give a hint of this, we report here our experience with the research,
design and implementation of a few tools that point in the proposed direction:
1. Social Knowledge Extractor (SKE) is a publicly available tool for
discovering emerging knowledge by extracting it from social
content. Once instrumented by experts through very simple initialization, the
tool is capable of nding emerging entities by means of a mixed
syntacticsemantic method. The method uses seeds, i.e. prototypes of emerging entities
provided by experts, for generating candidates; then, it associates candidates
to feature vectors, built by using terms occurring in their social content, and
then ranks the candidates by using their distance from the centroid of seeds,
returning the top candidates as result. The tool can run continuously or
with periodic iterations, using the results as new seeds. Our research on this
has been published in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a simpli ed implementation is currently available
online for demo purposes at:
http://datascience.deib.polimi.it/social-knowledge/,
and the code is available as open-source under an Apache 2.0 license on
GitHub at:
https://github.com/DataSciencePolimi/social-knowledge-extractor.
2. TripleWave is a tool for disseminating and exchanging RDF streams
on the Web. At the purpose of processing information streams in
realtime and at Web scale, TripleWave integrates nicely with RDF Stream
Processing (RSP) and Stream Reasoning (SR) as solutions to combine
semantic technologies with stream and event processing techniques. In
particular, it integrates with an existing ecosystem of solutions to query, reason
and perform real-time processing over heterogeneous and distributed data
streams. TripleWave can be fed with existing Web streams (e.g. Twitter and
Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked
Sensor Data dataset) and it can be invoked through both pull- and push-based
mechanisms, thus enabling RSP engines to automatically register and
receive data from TripleWave. The tool has been described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the code
is available as open-source on GitHub at
https://github.com/streamreasoning/TripleWave/.
3. RSPlab enables e cient design and execution of reproducible
experiments, as well as sharing of the results. It integrates two existing RSP
benchmarks (LSBench and CityBench) and two RSP engines (C-SPARQL
engine and CQELS). It provides a programmatic environment to: deploy in
the cloud RDF Streams and RSP engines; interact with them using
TripleWave and RSP Services; continuously monitor their performances and collect
statistics. RSPlab is released as open-source under an Apache 2.0 license is
currently under submission at ISWC - Resources Track and is available on
GitHub at
https://github.com/streamreasoning/rsplab.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We believe that knowledge intaking by scholars is going to become more and
more time consuming and expensive, due to the amount of knowledge that is
being built and shared everyday. We envision a comprehensive approach based
on integrated tools that allow data collection, cleaning, integration, analysis and
semantic representation that can be run continuously for keeping the
formalized knowledge bases aligned with the evolution of knowledge, with
limited cost and high recall on the facts and concepts that emerge
or decay. These tools do not need to be implemented by the same vendor or
provider; we instead advocate for opensource publishing of all the
implementations, as well as for the de nition of an agreed-upon integration platform that
allows them all to integrate appropriately.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Outlook on Research Resource Sharing</title>
      <p>As we envisioned an ecosystem that includes, but is not limited to, modules for
extraction, sharing and benchmarking, two research questions require
investigation in the immediate future.</p>
      <p>First, how can we design and publish new resources for such an ecosystem?
Do they exist already? It is important to understand what else is available out
there. Researchers commonly support their scienti c studies with resources that
can bene t the whole community, if released. The release process must comply
with a scienti c method that ensures repeatability and reproducibility. However,
a standard agreed-upon methodology that guide this process does not exists yet.</p>
      <p>Second, how should we combine these resources towards shared research
work ows? To investigate this research question, we need a platform that enables
researchers to deploy their resources and interact with the ecosystem. Therefore,
we call for an open discussion about how this integration should be done.
References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Acko</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>From data to wisdom</article-title>
          .
          <source>Journal of applied systems analysis 16(1)</source>
          , 3{
          <issue>9</issue>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brambilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daniel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valle</surname>
          </string-name>
          , E.D.:
          <article-title>On the quest for changing knowledge</article-title>
          .
          <source>In: Proceedings of the Workshop on Data-Driven Innovation on the Web - DDI '16</source>
          . ACM Press (
          <year>2016</year>
          ), https://doi.org/10.1145%
          <fpage>2F2911187</fpage>
          .
          <fpage>2914582</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brambilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valle</surname>
            ,
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Volonterio</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>F.X.A.</given-names>
          </string-name>
          :
          <article-title>Extracting Emerging Knowledge from Social Media</article-title>
          .
          <source>In: Proceedings of the 26th International Conference on World Wide Web - WWW '17</source>
          . ACM Press (
          <year>2017</year>
          ), https://doi. org/10.1145%
          <fpage>2F3038912</fpage>
          .
          <fpage>3052697</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mauri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calbimonte</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dell'Aglio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balduini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brambilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valle</surname>
            ,
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aberer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>TripleWave: Spreading RDF Streams on the Web</article-title>
          .
          <source>In: Lecture Notes in Computer Science</source>
          , pp.
          <volume>140</volume>
          {
          <fpage>149</fpage>
          . Springer International Publishing (
          <year>2016</year>
          ), https://doi.org/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>319</fpage>
          -46547-0_
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stieglitz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Xuan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruns</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neuberger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Social Media Analytics</article-title>
          .
          <source>Business &amp; Information Systems Engineering</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>89</volume>
          {96 (feb
          <year>2014</year>
          ), https://doi. org/10.1007%
          <fpage>2Fs12599</fpage>
          -
          <fpage>014</fpage>
          -0315-7
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>