Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                    Semantic Retrieval Interface
                    for Statistical Research Data

                            Daniel Bahls, Klaus Tochtermann

           Leibniz Information Centre for Economics (ZBW), Kiel, Germany


        Abstract. Statistical research data is the foundation for empirical stud-
        ies. Researchers in economics or social sciences often obtain such data
        from external sources through specially designed retrieval interfaces from
        statistical offices, commercial data providers as well as from data agen-
        cies and other archives. With the advancements in data cataloguing and
        acquisition of long tail research data sets from individual scientists and
        institutes, the opportunity is there to install central services for a more
        holistic data search. In view of a rapid increase in amount of data avail-
        able and by association an emerging retrieval problem, retrieval inter-
        faces must make effective use of provided metadata in order to help find
        relevant data sets efficiently.
        This paper presents a multi-step retrieval interface that aims to support
        the researchers’ natural approach to data search and composition. Start-
        ing with an idea of the concepts that are to be compared, users kick off
        their search with thesauri terms and successively specify requirements ac-
        cording to their priorities until suitable data can be selected easily from
        a manageable number of matching data sets. The prototype presented
        in this paper also provides means for convenient data harmonization,
        which is an essential aspect especially when combining statistical data
        from different sources.

        Keywords: Research Data Management, Semantic Digital Data Library,
        Linked Data, Statistics, Data Retrieval


1     Introduction

A significant number of scientific results are based on research data, since re-
search has become increasingly data-driven over the years [1]. Therefore, to un-
derstand such scientific publications in depth, documentation on underlying data
is a necessary means. To further provide transparency and enable replicability
in the end, respective data sets must be available as such, for which a reliable
infrastructure is required. Scientific data needs to be maintained and organized
in archives.
    With the advancement of computer technology, scientific analyses are more
and more carried out with the aid of machines, as it allows for large amounts
of data being processed in short amount of time which has never been possible
before. While this certainly is one reason why science has become significantly


                                             93
    Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


data-driven, it also leads to the fact that most scientific data is maintained
in digital form already. This circumstance and the rise of the Web opens up
possibilities for a powerful information infrastructure for supporting these afore-
mentioned goals. Information resources nowadays can be delivered to any place in
the world within seconds, laying the ground for delivering the right information
to the right place at the right time, the precept of knowledge management.
   The Web together with its well-established Web 2.0 technologies has already
been recognized as a powerful media for promoting efficient exchange and ad-
vancement in the scientific domain. In this regard, the Leibniz Association has
recently started the research alliance Science 2.01 with a growing number of 30
associated institutes to jointly venture into a well-organized and integrated envi-
ronment of Web-based tools and services for the scientific community to support
rapid exchange and good scientific practice.
    The vision of a thought-out research data infrastructure fits well into this
theme, and many initiatives have formed in the last years, a whole movement to
effectively enable exchange, citation and preservation of research data. However,
this task has proven non-trivial, as it opened up exhaustive discussions on meta-
data schemes2 , organized preservation and curation [2], responsibilities [3], data
publication policies [4] as well as solutions to overcome issues of data protection
and usage rights, only to mention a few. Yet, these efforts have already lead to
significant advancements (TheDataHub3 , DataCite4 , and other).
    At present, efforts are being made to pick up research data as bibliographic
artifacts for re-use, transparency and citation[5]. In view of a rapid increase
in amount of data available and by association an emerging retrieval problem,
retrieval interfaces must make effective use of provided metadata in order to help
find relevant data sets efficiently.
    In this paper, we investigate how to make use of Semantic Web technologies
for providing an efficient and novel approach for the retrieval of statistical data
sets that follows a natural approach for data retrieval in the domain of statistics,
particularly in the context of economics or the social sciences. Section 2 elab-
orates on the practice of data acquisition in empirical research to gain a clear
picture on the purpose of our system. Related work is discussed in the subse-
quent section, and Section 4 explains fundamental design decisions and outlines
a system architecture. Section 5 describes the user interface itself and how the
declared goals have been implemented into features. The paper eventually closes
with conclusions and outlook.


1
  http://www.leibniz-science20.de
2
  particularly important, as in contrast to textual publications, data cannot be under-
  stood without documentation
3
  http://datahub.io/
4
  http://www.datacite.org/


                                             94
    Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


2     Retrieving Statistical Data
In many cases, empirical researchers in economics and the social sciences are
to put together statistical indicators in large data tables. Typically, each col-
umn represents one indicator while the rows represent respective data per year,
country or other so-called dimension. The data itself may be self-produced in
terms of studies and surveys or acquired from external sources such as statis-
tical offices, affiliated institutes or purchased from commercial data providers.
However, common practice is to combine several sources, since some indicators
may be obtained from one source while the data for other indicators may be ob-
tained from another one. In this regard, researchers have to be extra careful to
make sure respective data represents the same or sufficiently similar statistical
population.
    To gain a clear picture of the goals of this research, we need to clearly under-
stand the purpose of the system. We have conducted interviews with economic
scientists which helped us gain insights in their work with research data. Em-
pirical researchers typically start out with an idea of concepts relevant in their
research (e.g. living standards, work conditions, economic growth, etc.). In addi-
tion, they have further details in mind, for instance on reference periods, regions
to be included and distinguished or frequency of data acquisition in case of time
series data. As a result, the data set should be as consistent as possible with
respect to acquisition method, statistical universe and adjustments. To achieve
user acceptance, the system has to be practical in research settings [6], and
therefore we aim to support this data harmonization procedure in a light-weight
manner.
    As a result, user communication should follow the below steps:
 1. Prompt for a list of concepts that are to be compared

 2. Let user specify additional requirements on the data

 3. Let explore and select matching data sets, allow for revisiting Step 2

 4. Offer selected data for download


    After finishing Step 1, data sets associated with the concepts named should
be presented to the user. Specification of additional requirements should be
based on the metadata available for the data sets found. As soon as all relevant
requirements are given, the user may inspect and decide on these satisfying data
sets and proceed to download at last.


3     Related Work
There are many repositories on the Web that provide statistical data. Some of
them are provided by statistical offices and data agencies (e.g. Federal Statistical


                                             95
    Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


Office of Germany5 , EuroStat6 , World Bank7 ), some are associated with com-
mercial providers (e.g. Thomson Reuters Datastream8 , Statista9 ) and yet others
are maintained by journals, archives, libraries or independent organizations (e.g.
GESIS10 , The Data Hub11 , Dataverse repository of Economists Online12 ). All
of these portals are as heterogeneous as the kind and spectrum of data they
provide. Some of them provide interfaces for composition of customized data
tables where users pick and choose indicators and data records according to
their needs. Such features are also provided by the Nesstar system13 , one of the
most prominent systems for data publishing and online analysis that is being
used by a large number of institutes. The Social Science Variables Database at
ICPSR14 allows for direct comparison of indicators with respect to a variety of
metadata, giving intuitive means to understand differences in universe, acquisi-
tion method and other between data sets. However, users of these systems are
to run keyword-based queries and browse through category trees in order to find
relevant data sets individually, and therefore our approach follows a different
paradigm as presented in Section 2.
    Technical challenges in dealing with distributed sources and applying the
OLAP paradigm for retrieval of statistical data from the Linked Data cloud
have been addressed in [7]. We view this work as a major contribution for build-
ing a scalable backend, whereas our work aims to provide a user interface and
communication design for data search and retrieval within the specific setting
research data sharing.
    Other approaches are based on semantic links between data sets and research
articles [8] which give textual context for otherwise sparsely described data con-
tent and therefore improve data search by established Information Retrieval
techniques. These data links, typically given by persistent identifiers, however,
point to entire data bundles as a whole, whereas our approach aims to make
single indicators and values available for retrieval.


4     System Architecture

Following the steps presented in Section 2, we elaborate on the system archi-
tecture of our data retrieval system. To support Step 1, a thesaurus should be
used, so that data sets associated with a particular concept can be found easily.
To enable the specification of requirements, metadata must be given in detail
5
   https://www.destatis.de
6
   http://epp.eurostat.ec.europa.eu
 7
   http://data.worldbank.org
 8
   http://online.thomsonreuters.com/datastream/
 9
   http://de.statista.com
10
   http://www.gesis.org/en/
11
   http://thedatahub.org
12
   http://dvn.iq.harvard.edu/dvn/dv/NEEO
13
   http://www.nesstar.com
14
   http://www.icpsr.umich.edu


                                             96
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


and in association with individual indicators and records rather than a separate
metadata block for a zipped data bundle. This enables the system to make sense
of the data in depth and allow for requirement specification as explained later
in Section 5.
    The research on a data retrieval interface is part of our overall research ac-
tivities on an infrastructure for scientific data for the field of economics. For
several reasons we regard Semantic Web technologies most suitable for this pur-
pose, among which is strength in dealing with distributed data and extensibil-
ity, which is required whenever highly specific long tail data from individual
researchers needs additional vocabulary for description [9]. However, the data
format should provide for typical data types, such as floats, strings, dates and
other. It must provide metadata on fine-grained level as to open up possibilities
for retrieval and composition. As a consequence, the retrieval system operates
on statistical data in the format of the RDF Data Cube Vocabulary15 [10].
    The prototype was implemented in Java and JavaScript under the use of the
Play Framework16 . The live system was tested on an Apache Tomcat17 and a
Sesame Triple Store18 , as the system operates on statistical data provided as
RDF using the RDF Data Cube Vocabulary19 [10].


5      User Interface Design

The system implements a multi-step retrieval interface as described in Section
2. In the following, we are going to refer to the screenshots given in Figure 1
to 8 in parantheses. Since the expected result is a data table after all, the main
screen starts with an empty spreadsheet (1). For Step 1, the user successively
enters the names of the concepts that are to be compared in the empty column
headers as shown in (2). This task is supported by autocompletion on the basis
of concept terms contained in a thesaurus, STW20 in our case. With the selection
of a concept, the system displays the number of associated data sets beneath
the concept label entered before. A click on this number lists all of them in
alphanumerical order (3), and another click reveals a detailed description and
further information on the particular data set (7). Yet, at this point, the number
of data sets might be huge, and the user may decide to formulate requirements
for the data first as per Step 2. With the selection of a single column header, the
panel on the left lists down the union over all properties and property values
available in the metadata of all the data sets associated with the concept of the
column (4). Hovering over a property or property value produces an info box with
documentation on the vocabulary. Selecting a particular property value specifies
a requirement and tells the system that only those data sets are relevant for
15
   http://www.w3.org/TR/vocab-data-cube/
16
   http://www.playframework.org
17
   http://tomcat.apache.org
18
   http://www.aduna-software.com/technology/sesame
19
   http://www.w3.org/TR/vocab-data-cube/
20
   STW Thesaurus for Economics, http://zbw.eu/stw/


                                              97
    Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


this column that provide this respective property and property value, and the
number of relevant data sets drops. With the selection of two or more column
headers, the panel on the left shows the intersection between the properties and
values of the single columns (5). This feature facilitates harmonization of data,
as it reveals which data characteristics can be unified among the columns. To
specify the contents of the rows, one must specify the Dimension property. A
click on the respective header highlights all column headers of the entire table as
to indicate that the property of choice must be available in the data sets of all
columns. The user selects (multiple) values from the properties listed on the left
and the Dimension column fills accordingly (6). This again sets requirements for
the data sets, as it filters all data sets that do not provide respective records.
Eventually, when all requirements are set, the user examines and selects from the
remaining list of data sets for each column (7). If all remaining properties with
multiple options are bound to a value, the table fills with actual data content
(8). As a last step, the table is offered for download.


                                            Fig. 1.


6     Conclusions and Outlook

Following the call for a research data infrastructure, we have addressed the issue
of data retrieval for the domain of economics and social sciences where large
amounts of scientific results are based on statistical data. With the prospect of a
rapidly growing amount of data from individual researchers and institutes filed
in the future, overviewing all relevant data sets efficiently becomes a problem.
For this purpose, we have designed an innovative retrieval interface that aims


                                             98
Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                                        Fig. 2.


                                        Fig. 3.


                                         99
Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                                        Fig. 4.


                                        Fig. 5.


                                        100
Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                                        Fig. 6.


                                        Fig. 7.


                                        101
  Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                                          Fig. 8.


to support researchers in finding and composing data sets according to their
natural way of approaching a research question. The prototype presented in
this paper provides simple means for data harmonization to enable consistency
within statistical population in intuitive ways. Under the use of these features,
we expect a significant decrease of time needed for data search and composition
in comparison to the current practice, although this is yet to be evaluated.
    Future improvements of the system should include retrieval from distributed
sources, as this version operates on a single triple store endpoint only Moreover,
the advantages of using subproperty relations should be investigated and made
available to the user. Many other valuable ideas for improvements can be found
with regard to user assistance, e.g. warning notifications when selected time
series data include breaks, errors or changes in acquisition method which can be
derived from well-maintained metadata.
    Finally, this approach needs to be tested on a large archive of various kinds of
statistical data and evaluated with end users from the target group of empirical
researchers.

References
 1. Gray, J.: Jim Gray on eScience: A Transformed Scientific Method (January 2007)
 2. Treloar, A., Harboe-Ree, C.: Data management and the curation continuum: how
    the Monash experience is informing repository relationships. Proceedings of VALA
    2008 (2007)
 3. Rümpel, S.: Data Librarianship : Anforderungen an Bibliothekare im Forschungs-
    datenmanagement (2010)
 4. Vlaeminck, S., Siegert, O.: Welche rolle spielen forschungsdaten eigentlich für
    fachzeitschriften? eine analyse mit fokus auf die wirtschaftswissenschaften. Tech-
    nical report, German Council for Social and Economic Data (RatSWD) (2012)


                                          102
   Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


 5. Wood, J., Andersson, T., Bachem, A., Best, C., Genova, F., Lopez, D.R., Los, W.,
    Marinucci, M., Romary, L., Van de Sompel, H., Vigen, J., Wittenburg, P., Giaretta,
    D.: Riding the wave: How Europe can gain from the rising tide of scientific data.
    European Union (2010) Final report of the High Level Expert Group on Scientific
    Data: A submission to the European Commission.
 6. Feijen, M.: What researchers want - a literature study of researchers’ requirements
    with respect to storage and access to research data (February 2011)
 7. Kämpgen, B., Harth, A.: Transforming statistical linked data for use in olap
    systems. In: Proceedings of the 7th international conference on Semantic systems,
    ACM (2011) 33–40
 8. Boland, K., Ritze, D., Eckert, K., Mathiak, B.: Identifying references to datasets in
    publications. In Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F., eds.: The-
    ory and Practice of Digital Libraries. Volume 7489 of Lecture Notes in Computer
    Science. Springer Berlin Heidelberg (2012) 150–161
 9. Bahls, D., Tochtermann, K.: Addressing the long tail in empirical research data
    management. In: Proceedings of the 12th International Conference on Knowledge
    Management and Knowledge Technologies. i-KNOW ’12, New York, NY, USA,
    ACM (2012) 19:1–19:8
10. Cyganiak, R., Field, S., Gregory, A., Halb, W., Tennison, J.: Semantic statis-
    tics: Bringing together sdmx and scovo. In Bizer, C., Heath, T., Berners-Lee,
    T., Hausenblas, M., eds.: LDOW. Volume 628 of CEUR Workshop Proceedings.,
    CEUR-WS.org (2010)


                                           103