Searching Data Portals – More Complex Than We Thought? Laura M Koesten Jaspreet Singh The Open Data Institute; Univ. of Southampton L3S Research Center, Hannover UK Germany laura.koesten@theodi.org singh@l3s.de ABSTRACT The amount of data published openly on the web is increasing Searching for Data is Complex rapidly. Most people either use web search or specialised data When done in a work context, the search for data is often com- portals, which are repositories of datasets, to search for data. Most plex. A previous interview study with data professionals across data portals today use similar faceted search interfaces. In this a wide range of domains and skill sets [5] suggests that, in the paper we focus on how a large governmental data portal in the majority of cases, searching for data shows characteristics of an UK supports users in conducting complex search tasks involving exploratory or complex search task. That involves multiple queries, data. Based on a previous interview study with users of the portal iterations and refinement of the original information need, as well we constructed a typical complex work task. In this work, we as complex cognitive processing. analyzed how the current system supports users during this task Data professionals, who are the primary users of such portals – and subsequently identify problems with the interface. Based on often engage in tasks which involve more than one dataset and a this we discuss potential research directions to improve interfaces sequence of queries to fulfill their information need. For example, for complex data related search tasks. tasks requiring datasets often involve trying to understand changes in data over time; or collecting different sources to make informed decisions based on relationships between them. There are several aspects that add to the complexity of search 1 INTRODUCTION tasks for data. In contrast to document search, users need skills to We live in the age of data-driven decision making where we take access and download data; interpret different or limited formats the action based on insights gathered from a collection of relevant data might be available in; and understand connected licences and datasets. A dataset in our scenario refers to structured informa- metadata. Furthermore, data requires context to create meaning [2], tion collected by an individual or organisation and distributed in a to make sense of data. In contrast to searching for digital objects, standard format, for instance CSV files containing bus timetables such as e.g. physical artifacts in a digital library, datasets contain collected by the local administration. Today, more than a mil- information within them which can be used to contextualize them lion datasets have been made available by governments worldwide and so support a search process. We currently rely on metadata, [7, 10]. The Web Data Commons project extracted no less than 233 which varies in quality and availability. However, we argue that million web tables containing structured data from HTML pages in utilising the original data to enrich metadata can provide relevant 2015 [6], earlier studies estimated the amount of structured data indexable content which would make data search more effective. on the web to be over one billion sources in February 2011 [1]. Decisions about the amount of context provided with the data With this increase in availability, searching for data is becoming are made by data publishers or by those designing data portals; more important. One of the primary ways to search for data on the interface design plays a key role in representing the context [4]. web is through data portals, which are repositories of datasets. The For example the UI of the UK governmental data portal 4 , as shown European Data Portal 1 indexes, to date, 629, 476 datasets published in Figure 1, shows the format and the publishing organization for by regional and national authorities in EU countries; the official US each dataset in the result list. government data portal2 covers 193, 976 and the UK portal 3 covers 37, 079 published datasets to date. Motivating example From discussion with experts and users Data search presents many challenges, as ideas and tools from of the portal we created an exemplary task: web search cannot (yet) be directly applied [9]. Using conventional web search engines is not ideal, as these have been designed pri- You work for the local council in the city of York (UK) and you marily for documents, not data [1]. This has led to the creation have been given the task to decide the top 3 areas in which to adver- of document surrogates for datasets which are indexed by search tise NHS (National Health Services) health checks. These are checks engines. These usually consist of a textual description and related recommended above a certain age by the NHS in the UK. An area that metadata presented for human consumption. should be prioritized would be one were many people are eligible, but 1 https://www.europeandataportal.eu/data/en/dataset haven’t participated before. 2 http://www.data.gov 3 https://data.gov.uk/data/search We know from previous studies that people experience difficul- CHIIR 2017 Workshop on Supporting Complex Search Tasks, Oslo, Norway. ties in finding datasets [5]. In this paper, we focus on such complex Copyright for the individual papers remains with the authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Published on CEUR-WS, Volume 1798, http://ceur-ws.org/Vol-1798/. 4 https://data.gov.uk/data/search CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway. Koesten & Singh search tasks and illustrate how search user interfaces on current dataset. On this page, next to a download link the user can click on data portals support such tasks. the button ”details”. This leads to a preview of the dataset which We highlight drawbacks of the current search interface, such shows the headers and a sample of rows (for CSV files). This further as snippets, dataset previews, and missing links between datasets. gives an indication of relevance and quality as it is the only item Following that we give possible directions for improvement and on the page which exposes the actual data. While the preview is further research. useful, it does not offer a comprehensive overview of the content of Search results are displayed similar to web search, with a title and the dataset, nor support for interpretation of the content. We argue short snippet. Furthermore, metadata including the data publisher that an overview has potential to be more meaningful by exposing (e.g. Public Health England), topical category (e.g. Society) and more information about the dataset to the user. An overview would format (e.g. CSV) are displayed. Clicking on a result takes the user e.g. give the range of values per column or the distinct locations to a page that contains the textual description and metadata. Some mentioned in the data and give information about the structural pages also include a dataset preview by displaying some portion of profile of the dataset. the raw data. The additional metadata displayed on the preview page is not helpful in selecting datsets for our task. E.g. further categories of 2 SEARCH USER INTERFACE information are shown such as ”last updated” and ”date updated” - Many data portals on the web offer a similar faceted-search user the difference is not clear and this information is also only available interfaces. The search results are displayed using the ten blue for some dataset packages. links paradigm found in web search. To highlight the drawbacks C. Limited support of discovering links between datasets of such interfaces, we select the UK governmental data portals’ In complex search tasks users commonly need to link multiple search interface (Figure 1) as a typical example 5 . The interface datasets together. In our task we link NHS data with demographic consists of a standard query bar and a series of facets to further data; for this we need to download the datasets and discover meth- filter results. Clicking on a search result takes the user to a preview ods of making connections manually. The system provides no page (Figure 2) that contains the textual description and metadata. recommendations or links to similar or complimentary datatsets. Some pages also include a dataset preview by displaying a portion Links also offer the possibility to understand how datasets are re- of the raw data. In this section we use the example task described lated to each other. A visualisation of links between datasets would in the introduction as a means to substantiate our claims. reduce cognitive load of the user and aid discovery. For our task, A. Search Result Display . Search results are displayed similar datasets that contain aggregated information about health checks to web search, with a title and short snippet. Further metadata is in the area York could link to each other, or data referring to specific presented, including the data publisher (e.g. Public Health England), locations could be linked to geospatial boundary data. topical category (e.g. Society) and format (e.g. CSV). An individual search result’s display should provide the user with sufficient infor- mation to judge the relevance, quality and usability of the results 3 FUTURE PERSPECTIVES [5]. For our task we issued an initial query ”NHS health check” We envision an interface which is tailored to data search. Such which returned 1,233 results. The format, data publisher and the an interface would better support users in (i) evaluating the rele- frequency of updates are displayed along with the title and first vance, quality and usability of a dataset result (ii) getting a suitable three lines of the description for each result. The system also pro- overview of the dataset and (iii) finding related datasets. vides a set of facets which aid the user in browsing the results. Due For search result display we believe that presenting a brief to the lack of a geographical facet however, we refined our query overview of the actual content of the dataset would support users to ”NHS health check York” and got 19 results. Subsequently, to in assessing whether a result is relevant for their task. We found that judge relevance we found that there was no indication as to which many datasets have descriptions that are incomplete or short – this part of the textual description of a dataset matched the query – suggests space for future research to automatically generate better title, description or metadata, as can be seen in Figure 1 in the third descriptions of datasets. We believe that novel methods to generate search result. This would help assess the relevance of a search result query-driven snippets both from the content and human generated as it gives an indication of the context in which a given query term description of a dataset would be highly beneficial. For instance, is used in the dataset. Little information about the granularity of by displaying headers, entities and/or summarising statistics of a the data is available, which makes it hard to judge whether the level relevant column or field alongside each search result. We propose of aggregation of the dataset is suitable for the task. Additionally it to capture these as additional metadata that can also be utilised was not apparent what type of data could be found in the dataset - for faceted browsing and indexed to improve ranking efficiency. geographical, time series, demographics, etc. This requires the user To judge data quality, we propose visual or textual indicators on to download and open each dataset individually to get more details. the interface, backed up by automatically computed metrics, user- Extending the existing facets to such, more content oriented, facets generated reviews and annotations or reuse statistics. would support complex search tasks. In the dataset preview page interactive visualisations could be B. The dataset preview page, can be accessed by clicking on used, which allow users to choose their area of interest within a a dataset. This page shows specific metadata as seen in (Figure larger dataset as well as providing a comprehensive overview of 2), however it did not give us an indication of the content of the the content. Filtering, sorting and exploring different views of the data on demand are recommended for such tasks. The discovery 5 https://data.gov.uk/data/search and exploration of links should be supported by interfaces by CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway. visualising connections between different datasets or data points, REFERENCES and possibly represent data within a network - to make a user un- [1] Michael J Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured data derstand its meaning within the context of other data. This could on the web. Commun. ACM 54, 2 (2011), 72–79. [2] Brenda Dervin. 1997. Given a context by any other name: Methodological tools also be used to create recommendation systems for datasets based for taming the unruly beast. Information seeking in context 13 (1997), 38. on reuse or on datasets which were downloaded together, as well [3] Christiaan Fluit, Marta Sabou, and Frank Van Harmelen. 2006. Ontology-based information visualization: toward semantic web applications. In Visualizing the as on content or structure of the dataset. Furthermore, we intend to semantic web. Springer, 45–58. experiment with search interface paradigms that go beyond ten blue [4] Saul Greenberg. 2001. Context as a dynamic construct. Human-Computer links. We are planning to draw on techniques used in semantic web Interaction 16, 2 (2001), 257–268. [5] Laura M Koesten, Emilia Kacprzak, Tennison Jenifer, and Elena Simperl. 2017. The technologies such as e.g. Cluster Maps [3] to provide graph based Trials and Tribulations of Working with Structured Data - a Study on Information visualizations that display connections between search results and Seeking Behaviour. In Proceedings of the 2017 CHI Conference on Human Factors other related datasets. [8] recommends 3 high level capabilities for in Computing Systems (CHI ’17). ACM, New York, NY, USA. [6] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. data exploration tools to support sensemaking: visual and interac- A large public corpus of web tables containing time and context metadata. In tive data exploration, data enrichment through recommendation Proceedings of the 25th International Conference Companion on World Wide Web. 75–76. DOI:http://dx.doi.org/10.1145/2872518.2889386 systems and data cleaning functionalities. Our suggestions can be [7] McKinsey Global Institute: James Manyika, Michael Chui, Peter Groves, Di- seen as a step in that direction. ana Farrell, Steve Van Kuiken, and Elizabeth Almasi Doshi. 2013. Open data: Acknowledgements This work is supported by the European Unlocking innovation and performance with liquid information. (2013). [8] Kristi Morton, Magdalena Balazinska, Dan Grossman, and Jock Mackinlay. 2014. Union Horizon 2020 program under the Marie Sklodowska-Curie Support the data enthusiast: Challenges for next-generation data-analysis sys- grant agreement No. 642795. tems. Proceedings of the VLDB Endowment 7, 6 (2014), 453–456. [9] Soo Young Rieh, Kevyn Collins-Thompson, Preben Hansen, and Hye-Jung Lee. 2016. Towards searching as a learning process: A review of current perspectives and future directions. Journal of Information Science 42, 1 (2016), 19–34. [10] Barbara Ubaldi. 2013. Open Government Data. (2013). CHIIR 2017 Workshop on Supporting Complex Search Tasks, March 11, 2017, Oslo, Norway. Koesten & Singh Figure 1: Search UI: The interface consists of a standard query bar and a series of facets to further filter results. Search results are displayed similar to web search, with a title and short snippet. Furthermore, metadata including the data publisher (e.g. Public Health England), topical category (e.g. Society) and format (e.g. CSV) are displayed. This interface is used at data.gov.uk/data/search, one of the largest European open data portals. Figure 2: The dataset preview page can be accessed by clicking on a dataset. This page shows some metadata: format, publishing organisation and date, licence, and an openness rating, topic tags on the portal, the harvest URL and date.