=Paper=
{{Paper
|id=Vol-1597/PROFILES2016_paper3
|storemode=property
|title=Position Paper: Dataset profling for un-Linked Data
|pdfUrl=https://ceur-ws.org/Vol-1597/PROFILES2016_paper3.pdf
|volume=Vol-1597
|authors=Emilia Kacprzak,Laura Koesten,Tom Heath,Jeni Tennison
|dblpUrl=https://dblp.org/rec/conf/esws/KacprzakKHT16
}}
==Position Paper: Dataset profling for un-Linked Data==
Position Paper: Dataset profiling for un-Linked Data Emilia Kacprzak1,2 , Laura Koesten1,2 , Tom Heath1 , and Jeni Tennison1 1 The Open Data Institute, London, United Kingdom {emilia.kacprzak,laura.koesten,tom.heath,jeni.tennison}@theodi.org, 2 University of Southampton, Southampton, United Kingdom Abstract. The vast amount of data on the web presents a growing need to advance data search. Rich and meaningful metadata can enhance the discovery of datasets and establish connections between them. Where metadata is not comprehensive, it can be expanded through dataset pro- filing. The relative importance of different types of profiles varies de- pending on the user’s context and the objective of the task. We discuss an approach to find un-Linked datasets and increase result relevance by offering related information. We propose generating rich profiles for datasets; counting the number and strength of relations between them and showing a graph of profiles that represents connections between dif- ferent datasets. We can thereby capture correlations between datasets that can then improve the efficiency and effectiveness of data search. If developed further this would improve discoverability and reusability of datasets. Keywords: Profiles, Metadata, Data search, Discoverability, Dataset 1 Introduction The availability of data is growing rapidly and with it, big data technologies and services. The vast amount of information available presents unique challenges around data integration, data ownership or assuring data quality to users [1]. One of the key challenges is the hidden character of many published datasets - this can be ascribed to a number of factors, one of which is the lack of useful metadata. Data search techniques are strongly dependent on adequate informa- tion about the datasets [2]. Searching is based on meaningfully indexed content, hence profiles are needed that describe the data in a manner that is most useful for discoverability. Value derived from data is generated to a great extent through understand- ing relations between different datasets and between the entities they describe. Most techniques for extracting information from datasets rely on the statisti- cal characteristics of the data or do not scale well [3]. Improving the quality and comprehensiveness of profiles can enable the discovery of datasets as well as enable the development of links between them; be they Linked or un-Linked data. 2 The remainder of this paper is structured as follows. Section 2 describes related work that influenced the concept presented in section 4. Section 3 gives a rationale for the focus on un-Linked data and presents use scenarios. 2 Related work 2.1 Profiling techniques Data profiling comprises a broad range of methods to efficiently describe datasets [3]. Analysis of individual cells or columns is used to create summaries of data [4]. Relationships between values in the data are described as association rules; which have been suggested as additional profiling activities. Naumann proposes joining two datasets based on their profiles, similar to a reverse engineering process which reveals possible relations between datasets based on constraints identified in the data [3]. Cell-level and column-level anal- ysis enables effective schema matching between datasets. Thereby initially not related schemas can be mapped and semantically correct correspondences can be revealed [5]. Integration of seemingly unconnected datasets can provide ad- ditional insights on the original data. These profiling techniques are strongly focused on numerical data and creat- ing associations between datasets columns. On the other side of the spectrum, Linked Data is used to improve profiling processes. Fetahu et al. propose build- ing a graph of linked datasets including the relative importance of each node, each of which is an openly available dataset. This is done with the objective of improving profiling accuracy [6]. Therefore, they developed an algorithm which adapts PageRank, K-Step Markov, and HITS models along with topical pro- files generated with DBpedia. Profiles are enriched by the use of Linked Data (ibid.). Our approach is complementary to this, attempting to create links be- tween datasets that are not already linked, by use of their profiles; as can be seen in section 4. Users should be involved in the process of searching for data and richer pro- files can support that search process. Elefi et al. emphasise the usefulness of explicit semantic information for the dataset discovery task to support the in- terlinking of datasets [7]. Wagner et al. present an approach to improve Linked Data search by entity-based data-source contextualization. In this they propose involving the user in the dataset selection process and place emphasis on the im- portance of not only offering exactly matching sources, but offering supplemen- tary information from additional sources to provide context. Additional sources are defined as being relevant to the user’s information need, rather than to the exact query. The proposed framework in section 4. follows a similar objective, but is not limited to Linked Data. 2.2 Types of profiles The most basic type of profiles are structural profiles for which no external re- source analysis is needed. These are created by reporting the number of columns, 3 rows and their data types, along with information about uniqueness and com- pleteness of the data [4, 3]. This leads to explicit single value analysis which is then often defined as a numerical or statistical profile. To generate additional insights, topical profiles provide connections to different resources, for example DBpedia, by evaluating the datasets topic coverage [6]. Spatial profiles describe geographical data [8]. If the geographic position could be expressed as a place it can be used in the same way as topical profiles. Although these different types of profiles exist, they were created to solve particular problems and are rarely used together. If combined meaningfully, these profiles could enhance discoverability of datasets even further. Section 4 describes how the availability of richer profiles could be used to understand connections between datasets. 3 Finding connections in un-linked data 3.1 Why un-linked data? Existing research on techniques for understanding the relationships between datasets, such as [2],[9],[10] are focused on Linked Data. However, most data that can be found on data portals is not published as Linked Data. An anal- ysis of open data available on data.gov.uk shows that out of all that are pub- lished under an Open Government license, less than 2% are RDF (12,470 open datasets published of which 213 are in RDF, 1.7%). This number excludes non- machine-interpretable formats such as PDF and HTML [11]. Hence, we argue that methods contributing to the improvement of profiles for all data formats enable discoverability and ultimately linking of data. 3.2 Use scenarios We are focused on three use scenarios that address different user needs. These serve as high-level exemplification of the variety of issues users can encounter when attempting to find datasets. Apart from the type of data, we note that the context and the individual objective when searching for datasets influences the relevance of a specific type of profile for a specific task. Scenario A Posing a question that can be answered by consulting a single data set. This scenario implies limited time and there should be no, or very limited, cost associated. Accuracy and licensing can be of varying importance. An example would be looking for the closest free wifi to the users current location. The profile should give information about whether or not this dataset contains data about my location and how up to date it is. Scenario B Looking for datasets to compare for relevance and quality to be integrated into an application. Time and cost can vary in importance. Here 4 information about granularity and reusability (licensing, format) of the data can be more important. For example, if a user wants to evaluate the air quality in London in 1995 in comparison to today and the metadata mentions the years 2000 to 2005, a naive search for the year 2004 might not provide results. Scenario C Comparison of as many as possible available datasets on a topic for research purposes. Here, more time is available and cost is less of an issue. Archived data can be important; coverage and granularity are selection criteria. For example, a user may want to compare all datasets available on a specific city to evaluate which areas to support financially. What is evident from these scenarios is that considering the context of the user and the objective of the task, search process and source selection may vary. Furthermore the presentation of results and the way different users might want to interact with those results is also dependent on these factors. The proposed approach in section 4 serves as an initial step to resolve the issues described in the above scenarios, by allowing user interaction in the process of source selection. 4 Connecting un-linked data As we have discussed, structural and statistical profiles can enhance the discovery of relations between datasets. Knowing the general structure of the dataset with its data types and value ranges can be the first step in narrowing down the number of columns that can relate to each other. This can be used to create actual links between unlinked datasets, e.g. by creating foreign keys between them [4]. Other types of profiles can be used for linking datasets in a broader, less strict, sense. Topical and spatial summaries are descriptions of datasets on a semantic level. Similar topics, coverage or data patterns can indicate relationships between datasets that could inform search. We propose to build a weighted graph based on the profiles of un-Linked data. By generating rich profiles for datasets and counting the number and strength of relations between them,correlations between datasets could be captured which can then improve the efficiency and effectiveness of search within open data portals, such as data.gov.uk. In accordance with Wagner et al. [9], we propose offering a broad range of results including contextual information, as well as the possibility of user involve- ment during source selection. The search process would be based on number and strength of connections between the profiles; after an initial search, the weighting of these individual connections could be influenced by the user. In the first step, based on a query, keyword extraction can be used to generate ranked keywords which, as well as informing an initial search, are presented to the user in order of importance. The user can interact with the list of keywords, by swapping them with each other, to achieve more relevant results. 5 The relevance of the keywords to the user influences the subset of datasets that appear within this graph - not all keywords are equally important to the user. Results that match with all keywords are of higher importance; datasets that match with only a few would be included if there were not enough results. Fig. 1(a) shows an example query about the population of London in 1995 and the datasets that matched the query the most. It also illustrates the strength of the connections between keywords and the profiles of the suggested datasets. This increases the end users awareness of the dataset content before opening it. As a second step, a graph of connections between related datasets is presented, Fig. 1. (a) Graph of connections between keywords and datasets, based on topic rele- vance (b) Results presentation of graph suggesting related datasets which is based on one chosen dataset from the results in fig 1(a). The weighting of connections between datasets are based on the similarity of the their profiles and the chosen datasets that matched the keywords from the original query, taking into account the users priorities. These initially identified datasets serve as recommendations and aid exploration of related information. Fig. 1(b) presents results after the user chose the dataset D1 to be the most relevant for their information need. To improve the search process, other profiles which have strong connections to the profile of D1 are presented along with information about the type of the connections between them. The graph should present an adjustable number of results in a graphical visualisation. The argument for the graph is to deepen the user’s understanding and the mental model of the search process. The availability of relevant data depends on the topic as well as the specificity of the search. Hence the number, the availability and the similarity of the datasets presented as results can vary significantly. Data profiling is naturally a user-oriented task [3] and we argue that following on from that, the user should be involved in source selection to provide more advanced data search. 5 Conclusions and future work Profile features vary in their importance dependent on the users context and objective. When looking at the use scenarios, section 3, different factors can be 6 important for users. Concerning the interaction with the system these concern time, cost, computational power, trust. Beyond that users have varying require- ments on the type of information covered in metadata, depending on their task and context. The approach described here would improve discoverability and aids all of the mentioned use scenarios as the user can be involved in the process of weighting the results. This can decrease the importance of the accuracy of ranking algo- rithms and enhance the quality of results according to ones individual needs. The possibility to detect similarities between profiles provides the basis for ex- plorative search, engaging the user in source selection. To further develop this ap- proach and validate its applicability we aim to develop profiles that support the identification of connections between datasets and topics, and between datasets and other datasets. Following from that we aim to develop interfaces, such as user-dependent result presentations, that support explorative data search. As the proposed graph is created on an abstract level and the connections are formed based on the datasets profiles the approach is not limited to a type of data, but can be used for Linked Data and un-Linked data at the same time; as the focus is to improve user experience. Copyright held by the authors References 1. Nasser, T., Tariq, R. S.: Big Data Challenges. J. Comput. Eng. Inf. Technol. vol. 4:3, 1000135 (2015) 2. Assaf, A., Troncy, R. , Senart, A.: Roomba: An extensible framework to validate and build dataset profiles. In: The Semantic Web: ESWC 2015 Satellite Events, pp. 325-339. Springer International Publishing (2015) 3. Naumann, F.: Data profiling revisited. ACM SIGMOD Record 42, no. 4., pp. 40-49. (2013) 4. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. The VLDB Journal, vol 24(4), 125. (2015) 5. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10: 334350. (2001) 6. Fetahu B., Dietze S., Nunes B.P., Casanova M. A.,Taibi D, Nejd W.l.: A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles (2014) 7. Ellefi, M. B., Bellahsene Z., Scharffe F. ,Todorov K. Towards Semantic Dataset Profiling. In: PROFILES@ ESWC (2014) 8. Shekhar, S., Celik, M., George, B., Mohan, P., Levine, N., Wilson, R. E., Mohanty, P. Spatial analysis of crime report datasets. National Science Foundation (NSF), Washington DC (2010) 9. Wagner, A., Haase P., Rettinger A., Lamm, H.. Entity-based data source contextu- alization for searching the Web of data. In: The Semantic Web: ESWC 2014 Satellite Events, pp. 25-41. Springer International Publishing (2014) 10. Boehm, C., Lorey J., Naumann F.: Creating void descriptions for web-scale data. Web Semantics: Science, Services and Agents on the World Wide Web 9, no. 3, pp. 339-345. (2011) 11. Data.gov.uk., Beta: Datasets. 15/03/16: https://data.gov.uk.