Searching Data Portals – More Complex Than We Thought?
                               Laura M Koesten                                                                  Jaspreet Singh
            The Open Data Institute; Univ. of Southampton                                             L3S Research Center, Hannover
                                 UK                                                                             Germany
                     laura.koesten@theodi.org                                                                 singh@l3s.de

ABSTRACT
The amount of data published openly on the web is increasing                         Searching for Data is Complex
rapidly. Most people either use web search or specialised data                          When done in a work context, the search for data is often com-
portals, which are repositories of datasets, to search for data. Most                plex. A previous interview study with data professionals across
data portals today use similar faceted search interfaces. In this                    a wide range of domains and skill sets [5] suggests that, in the
paper we focus on how a large governmental data portal in the                        majority of cases, searching for data shows characteristics of an
UK supports users in conducting complex search tasks involving                       exploratory or complex search task. That involves multiple queries,
data. Based on a previous interview study with users of the portal                   iterations and refinement of the original information need, as well
we constructed a typical complex work task. In this work, we                         as complex cognitive processing.
analyzed how the current system supports users during this task                         Data professionals, who are the primary users of such portals –
and subsequently identify problems with the interface. Based on                      often engage in tasks which involve more than one dataset and a
this we discuss potential research directions to improve interfaces                  sequence of queries to fulfill their information need. For example,
for complex data related search tasks.                                               tasks requiring datasets often involve trying to understand changes
                                                                                     in data over time; or collecting different sources to make informed
                                                                                     decisions based on relationships between them.
                                                                                        There are several aspects that add to the complexity of search
1    INTRODUCTION                                                                    tasks for data. In contrast to document search, users need skills to
We live in the age of data-driven decision making where we take                      access and download data; interpret different or limited formats the
action based on insights gathered from a collection of relevant                      data might be available in; and understand connected licences and
datasets. A dataset in our scenario refers to structured informa-                    metadata. Furthermore, data requires context to create meaning [2],
tion collected by an individual or organisation and distributed in a                 to make sense of data. In contrast to searching for digital objects,
standard format, for instance CSV files containing bus timetables                    such as e.g. physical artifacts in a digital library, datasets contain
collected by the local administration. Today, more than a mil-                       information within them which can be used to contextualize them
lion datasets have been made available by governments worldwide                      and so support a search process. We currently rely on metadata,
[7, 10]. The Web Data Commons project extracted no less than 233                     which varies in quality and availability. However, we argue that
million web tables containing structured data from HTML pages in                     utilising the original data to enrich metadata can provide relevant
2015 [6], earlier studies estimated the amount of structured data                    indexable content which would make data search more effective.
on the web to be over one billion sources in February 2011 [1].                         Decisions about the amount of context provided with the data
   With this increase in availability, searching for data is becoming                are made by data publishers or by those designing data portals;
more important. One of the primary ways to search for data on the                    interface design plays a key role in representing the context [4].
web is through data portals, which are repositories of datasets. The                 For example the UI of the UK governmental data portal 4 , as shown
European Data Portal 1 indexes, to date, 629, 476 datasets published                 in Figure 1, shows the format and the publishing organization for
by regional and national authorities in EU countries; the official US                each dataset in the result list.
government data portal2 covers 193, 976 and the UK portal 3 covers
37, 079 published datasets to date.                                                     Motivating example From discussion with experts and users
   Data search presents many challenges, as ideas and tools from                     of the portal we created an exemplary task:
web search cannot (yet) be directly applied [9]. Using conventional
web search engines is not ideal, as these have been designed pri-                       You work for the local council in the city of York (UK) and you
marily for documents, not data [1]. This has led to the creation                     have been given the task to decide the top 3 areas in which to adver-
of document surrogates for datasets which are indexed by search                      tise NHS (National Health Services) health checks. These are checks
engines. These usually consist of a textual description and related                  recommended above a certain age by the NHS in the UK. An area that
metadata presented for human consumption.                                            should be prioritized would be one were many people are eligible, but
1 https://www.europeandataportal.eu/data/en/dataset                                  haven’t participated before.
2 http://www.data.gov
3 https://data.gov.uk/data/search
                                                                                     We know from previous studies that people experience difficul-
CHIIR 2017 Workshop on Supporting Complex Search Tasks, Oslo, Norway.
                                                                                     ties in finding datasets [5]. In this paper, we focus on such complex
Copyright for the individual papers remains with the authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its
editors. Published on CEUR-WS, Volume 1798, http://ceur-ws.org/Vol-1798/.
                                                                                     4 https://data.gov.uk/data/search
CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway.                                          Koesten & Singh


search tasks and illustrate how search user interfaces on current         dataset. On this page, next to a download link the user can click on
data portals support such tasks.                                          the button ”details”. This leads to a preview of the dataset which
   We highlight drawbacks of the current search interface, such           shows the headers and a sample of rows (for CSV files). This further
as snippets, dataset previews, and missing links between datasets.        gives an indication of relevance and quality as it is the only item
Following that we give possible directions for improvement and            on the page which exposes the actual data. While the preview is
further research.                                                         useful, it does not offer a comprehensive overview of the content of
   Search results are displayed similar to web search, with a title and   the dataset, nor support for interpretation of the content. We argue
short snippet. Furthermore, metadata including the data publisher         that an overview has potential to be more meaningful by exposing
(e.g. Public Health England), topical category (e.g. Society) and         more information about the dataset to the user. An overview would
format (e.g. CSV) are displayed. Clicking on a result takes the user      e.g. give the range of values per column or the distinct locations
to a page that contains the textual description and metadata. Some        mentioned in the data and give information about the structural
pages also include a dataset preview by displaying some portion of        profile of the dataset.
the raw data.                                                                The additional metadata displayed on the preview page is not
                                                                          helpful in selecting datsets for our task. E.g. further categories of
2    SEARCH USER INTERFACE                                                information are shown such as ”last updated” and ”date updated” -
Many data portals on the web offer a similar faceted-search user          the difference is not clear and this information is also only available
interfaces. The search results are displayed using the ten blue           for some dataset packages.
links paradigm found in web search. To highlight the drawbacks               C. Limited support of discovering links between datasets
of such interfaces, we select the UK governmental data portals’           In complex search tasks users commonly need to link multiple
search interface (Figure 1) as a typical example 5 . The interface        datasets together. In our task we link NHS data with demographic
consists of a standard query bar and a series of facets to further        data; for this we need to download the datasets and discover meth-
filter results. Clicking on a search result takes the user to a preview   ods of making connections manually. The system provides no
page (Figure 2) that contains the textual description and metadata.       recommendations or links to similar or complimentary datatsets.
Some pages also include a dataset preview by displaying a portion         Links also offer the possibility to understand how datasets are re-
of the raw data. In this section we use the example task described        lated to each other. A visualisation of links between datasets would
in the introduction as a means to substantiate our claims.                reduce cognitive load of the user and aid discovery. For our task,
A. Search Result Display . Search results are displayed similar           datasets that contain aggregated information about health checks
to web search, with a title and short snippet. Further metadata is        in the area York could link to each other, or data referring to specific
presented, including the data publisher (e.g. Public Health England),     locations could be linked to geospatial boundary data.
topical category (e.g. Society) and format (e.g. CSV). An individual
search result’s display should provide the user with sufficient infor-
mation to judge the relevance, quality and usability of the results
                                                                          3    FUTURE PERSPECTIVES
[5]. For our task we issued an initial query ”NHS health check”           We envision an interface which is tailored to data search. Such
which returned 1,233 results. The format, data publisher and the          an interface would better support users in (i) evaluating the rele-
frequency of updates are displayed along with the title and first         vance, quality and usability of a dataset result (ii) getting a suitable
three lines of the description for each result. The system also pro-      overview of the dataset and (iii) finding related datasets.
vides a set of facets which aid the user in browsing the results. Due        For search result display we believe that presenting a brief
to the lack of a geographical facet however, we refined our query         overview of the actual content of the dataset would support users
to ”NHS health check York” and got 19 results. Subsequently, to           in assessing whether a result is relevant for their task. We found that
judge relevance we found that there was no indication as to which         many datasets have descriptions that are incomplete or short – this
part of the textual description of a dataset matched the query –          suggests space for future research to automatically generate better
title, description or metadata, as can be seen in Figure 1 in the third   descriptions of datasets. We believe that novel methods to generate
search result. This would help assess the relevance of a search result    query-driven snippets both from the content and human generated
as it gives an indication of the context in which a given query term      description of a dataset would be highly beneficial. For instance,
is used in the dataset. Little information about the granularity of       by displaying headers, entities and/or summarising statistics of a
the data is available, which makes it hard to judge whether the level     relevant column or field alongside each search result. We propose
of aggregation of the dataset is suitable for the task. Additionally it   to capture these as additional metadata that can also be utilised
was not apparent what type of data could be found in the dataset -        for faceted browsing and indexed to improve ranking efficiency.
geographical, time series, demographics, etc. This requires the user      To judge data quality, we propose visual or textual indicators on
to download and open each dataset individually to get more details.       the interface, backed up by automatically computed metrics, user-
Extending the existing facets to such, more content oriented, facets      generated reviews and annotations or reuse statistics.
would support complex search tasks.                                          In the dataset preview page interactive visualisations could be
    B. The dataset preview page, can be accessed by clicking on           used, which allow users to choose their area of interest within a
a dataset. This page shows specific metadata as seen in (Figure           larger dataset as well as providing a comprehensive overview of
2), however it did not give us an indication of the content of the        the content. Filtering, sorting and exploring different views of the
                                                                          data on demand are recommended for such tasks. The discovery
5 https://data.gov.uk/data/search                                         and exploration of links should be supported by interfaces by
                                              CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway.


visualising connections between different datasets or data points,     REFERENCES
and possibly represent data within a network - to make a user un-       [1] Michael J Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured data
derstand its meaning within the context of other data. This could           on the web. Commun. ACM 54, 2 (2011), 72–79.
                                                                        [2] Brenda Dervin. 1997. Given a context by any other name: Methodological tools
also be used to create recommendation systems for datasets based            for taming the unruly beast. Information seeking in context 13 (1997), 38.
on reuse or on datasets which were downloaded together, as well         [3] Christiaan Fluit, Marta Sabou, and Frank Van Harmelen. 2006. Ontology-based
                                                                            information visualization: toward semantic web applications. In Visualizing the
as on content or structure of the dataset. Furthermore, we intend to        semantic web. Springer, 45–58.
experiment with search interface paradigms that go beyond ten blue      [4] Saul Greenberg. 2001. Context as a dynamic construct. Human-Computer
links. We are planning to draw on techniques used in semantic web           Interaction 16, 2 (2001), 257–268.
                                                                        [5] Laura M Koesten, Emilia Kacprzak, Tennison Jenifer, and Elena Simperl. 2017. The
technologies such as e.g. Cluster Maps [3] to provide graph based           Trials and Tribulations of Working with Structured Data - a Study on Information
visualizations that display connections between search results and          Seeking Behaviour. In Proceedings of the 2017 CHI Conference on Human Factors
other related datasets. [8] recommends 3 high level capabilities for        in Computing Systems (CHI ’17). ACM, New York, NY, USA.
                                                                        [6] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016.
data exploration tools to support sensemaking: visual and interac-          A large public corpus of web tables containing time and context metadata. In
tive data exploration, data enrichment through recommendation               Proceedings of the 25th International Conference Companion on World Wide Web.
                                                                            75–76. DOI:http://dx.doi.org/10.1145/2872518.2889386
systems and data cleaning functionalities. Our suggestions can be       [7] McKinsey Global Institute: James Manyika, Michael Chui, Peter Groves, Di-
seen as a step in that direction.                                           ana Farrell, Steve Van Kuiken, and Elizabeth Almasi Doshi. 2013. Open data:
   Acknowledgements This work is supported by the European                  Unlocking innovation and performance with liquid information. (2013).
                                                                        [8] Kristi Morton, Magdalena Balazinska, Dan Grossman, and Jock Mackinlay. 2014.
Union Horizon 2020 program under the Marie Sklodowska-Curie                 Support the data enthusiast: Challenges for next-generation data-analysis sys-
grant agreement No. 642795.                                                 tems. Proceedings of the VLDB Endowment 7, 6 (2014), 453–456.
                                                                        [9] Soo Young Rieh, Kevyn Collins-Thompson, Preben Hansen, and Hye-Jung Lee.
                                                                            2016. Towards searching as a learning process: A review of current perspectives
                                                                            and future directions. Journal of Information Science 42, 1 (2016), 19–34.
                                                                       [10] Barbara Ubaldi. 2013. Open Government Data. (2013).
CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway.                                                               Koesten & Singh


Figure 1: Search UI: The interface consists of a standard query bar and a series of facets to further filter results. Search results are displayed similar to web search,
with a title and short snippet. Furthermore, metadata including the data publisher (e.g. Public Health England), topical category (e.g. Society) and format (e.g.
CSV) are displayed. This interface is used at data.gov.uk/data/search, one of the largest European open data portals.


Figure 2: The dataset preview page can be accessed by clicking on a dataset. This page shows some metadata: format, publishing organisation and date, licence,
and an openness rating, topic tags on the portal, the harvest URL and date.