<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Position Paper: Dataset pro ling for un-Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emilia Kacprzak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Koesten</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Heath</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeni Tennison</string-name>
          <email>jeni.tennisong@theodi.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Open Data Institute</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southampton</institution>
          ,
          <addr-line>Southampton</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The vast amount of data on the web presents a growing need to advance data search. Rich and meaningful metadata can enhance the discovery of datasets and establish connections between them. Where metadata is not comprehensive, it can be expanded through dataset proling. The relative importance of di erent types of pro les varies depending on the user's context and the objective of the task. We discuss an approach to nd un-Linked datasets and increase result relevance by o ering related information. We propose generating rich pro les for datasets; counting the number and strength of relations between them and showing a graph of pro les that represents connections between different datasets. We can thereby capture correlations between datasets that can then improve the e ciency and e ectiveness of data search. If developed further this would improve discoverability and reusability of datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Pro les</kwd>
        <kwd>Metadata</kwd>
        <kwd>Data search</kwd>
        <kwd>Discoverability</kwd>
        <kwd>Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The availability of data is growing rapidly and with it, big data technologies and
services. The vast amount of information available presents unique challenges
around data integration, data ownership or assuring data quality to users [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
One of the key challenges is the hidden character of many published datasets
this can be ascribed to a number of factors, one of which is the lack of useful
metadata. Data search techniques are strongly dependent on adequate
information about the datasets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Searching is based on meaningfully indexed content,
hence pro les are needed that describe the data in a manner that is most useful
for discoverability.
      </p>
      <p>
        Value derived from data is generated to a great extent through
understanding relations between di erent datasets and between the entities they describe.
Most techniques for extracting information from datasets rely on the
statistical characteristics of the data or do not scale well [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Improving the quality
and comprehensiveness of pro les can enable the discovery of datasets as well
as enable the development of links between them; be they Linked or un-Linked
data.
      </p>
      <p>The remainder of this paper is structured as follows. Section 2 describes
related work that in uenced the concept presented in section 4. Section 3 gives
a rationale for the focus on un-Linked data and presents use scenarios.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <sec id="sec-2-1">
        <title>Pro ling techniques</title>
        <p>
          Data pro ling comprises a broad range of methods to e ciently describe datasets
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Analysis of individual cells or columns is used to create summaries of data
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Relationships between values in the data are described as association rules;
which have been suggested as additional pro ling activities.
        </p>
        <p>
          Naumann proposes joining two datasets based on their pro les, similar to
a reverse engineering process which reveals possible relations between datasets
based on constraints identi ed in the data [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Cell-level and column-level
analysis enables e ective schema matching between datasets. Thereby initially not
related schemas can be mapped and semantically correct correspondences can
be revealed [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Integration of seemingly unconnected datasets can provide
additional insights on the original data.
        </p>
        <p>
          These pro ling techniques are strongly focused on numerical data and
creating associations between datasets columns. On the other side of the spectrum,
Linked Data is used to improve pro ling processes. Fetahu et al. propose
building a graph of linked datasets including the relative importance of each node,
each of which is an openly available dataset. This is done with the objective of
improving pro ling accuracy [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Therefore, they developed an algorithm which
adapts PageRank, K-Step Markov, and HITS models along with topical
proles generated with DBpedia. Pro les are enriched by the use of Linked Data
(ibid.). Our approach is complementary to this, attempting to create links
between datasets that are not already linked, by use of their pro les; as can be
seen in section 4.
        </p>
        <p>
          Users should be involved in the process of searching for data and richer
proles can support that search process. Ele et al. emphasise the usefulness of
explicit semantic information for the dataset discovery task to support the
interlinking of datasets [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Wagner et al. present an approach to improve Linked
Data search by entity-based data-source contextualization. In this they propose
involving the user in the dataset selection process and place emphasis on the
importance of not only o ering exactly matching sources, but o ering
supplementary information from additional sources to provide context. Additional sources
are de ned as being relevant to the user's information need, rather than to the
exact query. The proposed framework in section 4. follows a similar objective,
but is not limited to Linked Data.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Types of pro les</title>
        <p>
          The most basic type of pro les are structural pro les for which no external
resource analysis is needed. These are created by reporting the number of columns,
rows and their data types, along with information about uniqueness and
completeness of the data [
          <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
          ]. This leads to explicit single value analysis which is
then often de ned as a numerical or statistical pro le. To generate additional
insights, topical pro les provide connections to di erent resources, for example
DBpedia, by evaluating the datasets topic coverage [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Spatial pro les describe
geographical data [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. If the geographic position could be expressed as a place it
can be used in the same way as topical pro les.
        </p>
        <p>Although these di erent types of pro les exist, they were created to solve
particular problems and are rarely used together. If combined meaningfully, these
pro les could enhance discoverability of datasets even further. Section 4 describes
how the availability of richer pro les could be used to understand connections
between datasets.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Finding connections in un-linked data</title>
      <sec id="sec-3-1">
        <title>Why un-linked data?</title>
        <p>
          Existing research on techniques for understanding the relationships between
datasets, such as [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] are focused on Linked Data. However, most data
that can be found on data portals is not published as Linked Data. An
analysis of open data available on data.gov.uk shows that out of all that are
published under an Open Government license, less than 2% are RDF (12,470 open
datasets published of which 213 are in RDF, 1.7%). This number excludes
nonmachine-interpretable formats such as PDF and HTML [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Hence, we argue
that methods contributing to the improvement of pro les for all data formats
enable discoverability and ultimately linking of data.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Use scenarios</title>
        <p>We are focused on three use scenarios that address di erent user needs. These
serve as high-level exempli cation of the variety of issues users can encounter
when attempting to nd datasets. Apart from the type of data, we note that the
context and the individual objective when searching for datasets in uences the
relevance of a speci c type of pro le for a speci c task.</p>
        <p>Scenario A Posing a question that can be answered by consulting a single
data set. This scenario implies limited time and there should be no, or very
limited, cost associated. Accuracy and licensing can be of varying importance.
An example would be looking for the closest free wi to the users current location.
The pro le should give information about whether or not this dataset contains
data about my location and how up to date it is.</p>
        <p>Scenario B Looking for datasets to compare for relevance and quality to be
integrated into an application. Time and cost can vary in importance. Here
information about granularity and reusability (licensing, format) of the data
can be more important. For example, if a user wants to evaluate the air quality
in London in 1995 in comparison to today and the metadata mentions the years
2000 to 2005, a naive search for the year 2004 might not provide results.
Scenario C Comparison of as many as possible available datasets on a topic
for research purposes. Here, more time is available and cost is less of an issue.
Archived data can be important; coverage and granularity are selection criteria.
For example, a user may want to compare all datasets available on a speci c city
to evaluate which areas to support nancially.</p>
        <p>What is evident from these scenarios is that considering the context of the
user and the objective of the task, search process and source selection may vary.
Furthermore the presentation of results and the way di erent users might want
to interact with those results is also dependent on these factors. The proposed
approach in section 4 serves as an initial step to resolve the issues described in the
above scenarios, by allowing user interaction in the process of source selection.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Connecting un-linked data</title>
      <p>
        As we have discussed, structural and statistical pro les can enhance the discovery
of relations between datasets. Knowing the general structure of the dataset with
its data types and value ranges can be the rst step in narrowing down the
number of columns that can relate to each other. This can be used to create
actual links between unlinked datasets, e.g. by creating foreign keys between
them [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Other types of pro les can be used for linking datasets in a broader, less strict,
sense. Topical and spatial summaries are descriptions of datasets on a semantic
level. Similar topics, coverage or data patterns can indicate relationships between
datasets that could inform search.</p>
      <p>We propose to build a weighted graph based on the pro les of un-Linked data.
By generating rich pro les for datasets and counting the number and strength of
relations between them,correlations between datasets could be captured which
can then improve the e ciency and e ectiveness of search within open data
portals, such as data.gov.uk.</p>
      <p>
        In accordance with Wagner et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we propose o ering a broad range of
results including contextual information, as well as the possibility of user
involvement during source selection. The search process would be based on number and
strength of connections between the pro les; after an initial search, the weighting
of these individual connections could be in uenced by the user.
      </p>
      <p>In the rst step, based on a query, keyword extraction can be used to generate
ranked keywords which, as well as informing an initial search, are presented to
the user in order of importance. The user can interact with the list of keywords,
by swapping them with each other, to achieve more relevant results.</p>
      <p>The relevance of the keywords to the user in uences the subset of datasets
that appear within this graph - not all keywords are equally important to the
user. Results that match with all keywords are of higher importance; datasets
that match with only a few would be included if there were not enough results.
Fig. 1(a) shows an example query about the population of London in 1995 and
the datasets that matched the query the most. It also illustrates the strength
of the connections between keywords and the pro les of the suggested datasets.
This increases the end users awareness of the dataset content before opening it.
As a second step, a graph of connections between related datasets is presented,
which is based on one chosen dataset from the results in g 1(a). The weighting
of connections between datasets are based on the similarity of the their pro les
and the chosen datasets that matched the keywords from the original query,
taking into account the users priorities. These initially identi ed datasets serve
as recommendations and aid exploration of related information.</p>
      <p>Fig. 1(b) presents results after the user chose the dataset D1 to be the most
relevant for their information need. To improve the search process, other pro les
which have strong connections to the pro le of D1 are presented along with
information about the type of the connections between them.</p>
      <p>
        The graph should present an adjustable number of results in a graphical
visualisation. The argument for the graph is to deepen the user's understanding
and the mental model of the search process. The availability of relevant data
depends on the topic as well as the speci city of the search. Hence the number,
the availability and the similarity of the datasets presented as results can vary
signi cantly. Data pro ling is naturally a user-oriented task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and we argue
that following on from that, the user should be involved in source selection to
provide more advanced data search.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>Pro le features vary in their importance dependent on the users context and
objective. When looking at the use scenarios, section 3, di erent factors can be
important for users. Concerning the interaction with the system these concern
time, cost, computational power, trust. Beyond that users have varying
requirements on the type of information covered in metadata, depending on their task
and context.</p>
      <p>The approach described here would improve discoverability and aids all of the
mentioned use scenarios as the user can be involved in the process of weighting
the results. This can decrease the importance of the accuracy of ranking
algorithms and enhance the quality of results according to ones individual needs.
The possibility to detect similarities between pro les provides the basis for
explorative search, engaging the user in source selection. To further develop this
approach and validate its applicability we aim to develop pro les that support the
identi cation of connections between datasets and topics, and between datasets
and other datasets. Following from that we aim to develop interfaces, such as
user-dependent result presentations, that support explorative data search. As the
proposed graph is created on an abstract level and the connections are formed
based on the datasets pro les the approach is not limited to a type of data, but
can be used for Linked Data and un-Linked data at the same time; as the focus
is to improve user experience.</p>
      <p>Copyright held by the authors</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nasser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tariq</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          :
          <article-title>Big Data Challenges</article-title>
          .
          <source>J. Comput. Eng. Inf. Technol</source>
          . vol.
          <volume>4</volume>
          :
          <issue>3</issue>
          ,
          <issue>1000135</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Assaf</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senart</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Roomba: An extensible framework to validate and build dataset pro les</article-title>
          .
          <source>In: The Semantic Web: ESWC 2015 Satellite Events</source>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>339</lpage>
          . Springer International Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data pro ling revisited</article-title>
          .
          <source>ACM SIGMOD Record</source>
          <volume>42</volume>
          , no.
          <issue>4</issue>
          ., pp.
          <fpage>40</fpage>
          -
          <lpage>49</lpage>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Pro ling relational data: a survey</article-title>
          .
          <source>The VLDB Journal</source>
          , vol
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <fpage>125</fpage>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>A survey of approaches to automatic schema matching</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>10</volume>
          :
          <fpage>334350</fpage>
          . (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fetahu</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taibi</surname>
            <given-names>D</given-names>
          </string-name>
          , Nejd W.l.:
          <article-title>A Scalable Approach for E ciently Generating Structured Dataset Topic Pro les (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elle</surname>
            ,
            <given-names>M. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schar</surname>
            e
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorov</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Towards Semantic</surname>
          </string-name>
          <article-title>Dataset Pro ling</article-title>
          .
          <source>In: PROFILES@ ESWC</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Shekhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>George</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>R. E.</given-names>
            ,
            <surname>Mohanty</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Spatial analysis of crime report datasets</article-title>
          .
          <source>National Science Foundation (NSF)</source>
          , Washington DC (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rettinger</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lamm</surname>
          </string-name>
          , H..
          <article-title>Entity-based data source contextualization for searching the Web of data</article-title>
          .
          <source>In: The Semantic Web: ESWC 2014 Satellite Events</source>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>41</lpage>
          . Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Boehm</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorey</surname>
            <given-names>J.</given-names>
          </string-name>
          , Naumann F.:
          <article-title>Creating void descriptions for web-scale data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>9</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>339</fpage>
          -
          <lpage>345</lpage>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Data</surname>
          </string-name>
          .gov.uk.,
          <source>Beta: Datasets</source>
          .
          <volume>15</volume>
          /03/16: https://data.gov.uk.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>