Efficient Search and Browsing of CSV Datasets

          Agata Filipowska1,2 , Krzysztof W˛ecel1,2 , and Dominik Filipiak1,2
                            1
                             Department of Information Systems
                      Faculty of Informatics and Electronic Economy
                             Poznań University of Economics
                   a.filipowska; k.wecel@kie.ue.poznan.pl
                      2
                        Instytut Informatyki Gospodarczej Sp. z o.o.
                        ul. Rubiez 12G/6, 61-612 Poznań, Poland


       Abstract. The paper presents an application developed within the FP7 LOD2
       project that supports efficient search and browsing of CSV files. The applica-
       tion indexes CSV files using DBpedia categories and sophisticated strategies for
       identification of the best descriptors. The paper introduces the application as well
       as presents its usage scenario. To the best of our knowledge no similar solution
       exists.


1     Introduction

The aim of the paper is to present the CSV Browser Application that may be used for
cataloguing CSV datasets that may be found on the Web.
    The research goal was to develop a solution for efficient search and browsing of
various datasets. Efficient in this case means, available through one platform, properly
indexed and easy to find. To fulfil this goal, we have developed a Java Web Application,
which reads CSV files and produces DBpedia-based annotations (indices) that are used,
while browsing resources with the Web application.
    The paper is structured as follows. Section 2 presents the CSV Browser Application
and describes how the tool works. Section 3 provides the usage example. Section 4
presents conclusions and outlines the future work.


2     CSV Browser Application

2.1   Data Sources

The application may catalogue datasets from every data source storing CSV files. In our
experiments, we processed datasets that are available at http://publicdata.eu.


2.2   How the application works?

The processing of a CSV file, in order to obtain annotations (indices), that are used,
while browsing datasets, is performed in the following steps:
                                          Posters & Demos Track @ SEMANTiCS2014          7

 1. An initial CSV file (retrieved from a source) is processed using a simple heuristics.
    The goal is to distinguish between columns storing the data and headers of these
    columns. We assume that columns’ headers should be on the top, the row headers
    (identifiers) should be included on the left side of the table with the data. Anything,
    but number, “N/A” or “-” is a field that describes a column or a row.
    The application reads the CSV file and sends headers together with the data from
    the first 100 rows to the DBpedia Spotlight API3 to annotate them with matching
    resources from DBpedia.
 2. For each resource retrieved, using SPARQL we retrieve its categories from DBpe-
    dia, then categories of these categories, etc. (up to three levels of hierarchy). Using
    the retrieved data, we create a graph of concepts, e.g. Figure 1 presents a graph
    retrieved for concepts:
    http://dbpedia.org/resource/Lyon, http://dbpedia.org/resource/Hamburg,
    http://dbpedia.org/resource/Rome, http://dbpedia.org/resource/Milan,
    http://dbpedia.org/resource/Berlin, http://dbpedia.org/resource/Paris.
 3. The next step concerns a reduction of a set of categories. The categories bringing
    little to a description of a dataset such as e.g. Main_topic_classification
    are removed using the blacklist4 . Then, we check, which from sub-graphs retrieved
    based on relations between DBpedia categories is the largest (taking into account
    the number of nodes) and we remove other sub-graphs from further analysis. At
    this stage, the categories’ graph describing the dataset is created.
 4. Having a categories’ graph, it is possible to apply algorithms known from the graph
    theory, mostly measuring the nodes centrality, but also PageRank and HITS, in
    order to define a leading category that best describes the content of a CSV file. In
    the prototype we used the Focused Betweenness Centrality measure. This way the
    importance (weight) of specific categories (nodes) as descriptors of a CSV resource
    is quantified.

    As a result, each CSV file gets indices enabling search and browsing of available
CSV resources using the DBpedia categories. The screenshot of the CSV Browser App
is presented in the Figure 2.
    The App menu contains three elements:

 – Category Search – a field where one may put a DBpedia category in order to find a
   dataset related to it (auto-completion is enabled),
 – Matching Algorithm – which refers to the category matching strategy applied in
   step 4 of the above mentioned procedure,
 – DBpedia Category Tree – enabling browsing datasets using the tree-like menu. It is
   important to note that the menu is dynamic, as the category structure in DBpedia is
   a lattice, what means that after choosing a subcategory of a category, one gets new
   supercategories of a chosen category.

   The main part of the screen presents a table with all datasets for a given category.
Each dataset is described using its original name and presenting the column, which
 3
     http://spotlight.dbpedia.org
 4
     http://uimr.deri.ie/sites/StopUris
8       Filipowska et al.


           Fig. 1: A graph of related concepts (after performing step no. 2).
                                Source: Gephi screenshot


                            Fig. 2: CSV Browser Application.
                                   Source: screenshot


stores data based on which a certain category has been assigned as a dataset index. It is
also possible to download a dataset or view it within the application.
                                     Posters & Demos Track @ SEMANTiCS2014            9

3   Usage example

The application usage example is presented in the Figure 3. It presents search results
for the category London. It may be noticed that there are 7 different datasets available
for this category. The datasets concern such categories as e.g. Areas_of_London or
London_boroughs. The tree-like menu also changed – was adjusted to the relations of
the London category to other categories in DBpedia.


                  Fig. 3: CSV Browser Application Usage Example.
                                   Source: screenshot


4   Conclusions and Future Work
The paper presents demonstration of the application enabling indexing of CSV datasets,
that makes searching for these datasets as well as browsing them more efficient. Effi-
cient in this case means, available through one platform, properly indexed and easy to
find. Each CSV file is indexed not taking into account a description of the dataset, but
its content. We have also developed a version of this application to process the Open
Data Support datasets. In this case however, we processed only titles and descriptions
of the datasets.
     Next steps concern assessing an accuracy of the indices created, integration with
CKAN as well as putting the indexing component in the pipeline of CSV to RDF trans-
lation.

Acknowledgement This work was supported by a grant from the EU 7th Framework
Programme for the project LOD2 Creating Knowledge out of Interlinked Data (GA no.
288176).