Towards a SPARQL 1.1 Feature Benchmark on Real-World Social Network Data Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, and Io Taxidou Department of Computer Science, University of Freiburg, Germany {zablocki,schaetzle,hornungt,taxidou} @informatik.uni-freiburg.de Abstract. In recent years, social networks have fundamentally changed our perception of the web and the way we interact with it. At the same time we have witnessed the vision of the “Semantic Web” picking up pace. From a general perspective, the inherent complex intertwined struc- ture of a social network contains a flood of semantic information about users, objects and their relations. On the other hand, social graph struc- tures are hardly covered by current state-of-the-art RDF benchmarks. Moreover, synthetic graph generators do not model all properties of a social network, especially structural correlations are either neglected or underrepresented. Considering the complex structure of a social graph, the enhanced features of SPARQL 1.1 open up new valuable possibilities, but these features are also currently neglected by most of the existing benchmarks. In this paper we introduce our concept of a new RDF bench- mark based on real-word social network data gathered from Last.fm with a special focus on SPARQL 1.1. 1 Introduction The advent of the “Semantic Web” promotes the growing adoption of RDF and SPARQL as its core technologies. We believe that current initiatives like schema.org, Google’s Knowledge Graph, the Linking Open Data (LOD) cloud as well as structured data markups for search engine optimization will further drive the propagation of these technologies. Furthermore, social networks like Google+, Facebook, Twitter, Last.fm etc. dramatically change the way how people interact, collaborate and share information, turning the traditional “Web of Documents” into an highly interactive and personalized interlinked “Web of Data”. According to this perspective, one can also interpret a social network graph as structured semantic data interlinking people and objects that can also be represented in RDF. This is also underpinned by the support for RDF added to Facebook’s Graph API in 2011 [18]. On the other hand, there is a lack of real-world RDF benchmark data in general, and social network data in particular [4, 16]. Most of the existing RDF benchmarks like BSBM [2], LUBM [7] or SP2 Bench [15] use artificial data gen- erated according to observed frequency distributions of a specific domain. While this approach allows to easily scale the size, the generated datasets have little in common with real RDF data as they resemble relational database bench- marks [12] with a high level of structuredness [4]. One of the few benchmarks using real data is the DBpedia SPARQL benchmark (DBPSB) [12] based on data dumps from DBpedia. However, compared to the size and dynamics of so- cial graphs, the DBpedia data is rather small and also limited in growth. The dataset used in [12] had a total size of ∼150 million RDF triples which shouldn’t pose a challenge for state-of-the-art RDF triple stores. Structural correlations are ubiquitous in social graphs, e.g. friendship rela- tionships are correlated with the place of residence. Knowledge of these cor- relations can have an important impact on query optimization but identifying them is a non-trivial task. The S3G2 data generator for structure-correlated social graphs [14] that is used for the Social Network Intelligence Benchmark (SIB) [3] focuses on this aspect. It can be used to generate arbitrary large social graphs with a pre-defined set of structural correlations. However, the authors emphasize that the generator will not produce ”realistic” social network data as these networks are expected to have many more (yet unknown) correlations. To overcome the conceptual shortcomings of synthetic data generators we outline a benchmark based on real-world data gathered from the social music network Last.fm. We decided to use Last.fm since it exhibits the characteristics of a social network (cf. Section 2) with millions of users and provides a public API1 to access the data. In addition, our benchmark will focus on the new features of SPARQL 1.1 [8], i.e. Property Paths, Aggregates, Subqueries and Negation, as they open up new possibilities for more sophisticated graph queries (cf. Section 3) which are of special interest for social networks regarding their typically complex graph structure. Indeed, these new features are neglected or not considered at all by current popular RDF benchmarks. To the best of our knowledge, this will be the first RDF benchmark on large-scale real-world social network data with a special focus on SPARQL 1.1. Paper Structure. Section 2 discusses the social network characteristics of Last.fm as well as the fragment that we will use for our benchmark dataset. In Section 3 we introduce some exemplary queries to demonstrate the power of SPARQL 1.1 for querying social network graphs, followed by a conclusion in Section 4. 2 Last.fm Benchmark Data Last.fm is an online music service with manifold relations between people, artists, tracks, etc. that constitute a highly connected graph with a large variety of correlations. In order to justify Last.fm as an appropriate base for our benchmark dataset, we analyzed its underlying social graph and investigated common social network characteristics. First of all, we crawled about 1.7 million users with close to 13.6 million friendship relationships using a Breadth-First Search (BFS) strategy. We are aware of the biases introduced by BFS in terms of degree 1 http://www.last.fm/api 106 100 105 10−1 avg. clustering coefficient 104 10−2 no. of nodes 103 10−3 102 10−4 101 10−5 100 10−6 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 degree degree Fig. 1: (a) Degree distribution (left), (b) Avg. cc-degree distribution (right) distribution and clustering coefficient [9, 19] as it tends to visit nodes of high degree to the detriment of nodes with lower degree. As a result, average degree is overestimated while clustering coefficient is underestimated since high degree nodes are characterized by a low clustering coefficient [11, 19]. This issue will be addressed for the final benchmark dataset with more sophisticated crawling techniques, in particular a modified Metropolis-Hasting Walk as proposed in [5, 6] that corrects the bias directly during the walk. Overall, the skewed degree distribution (cf. Figure 1 (a)), average clustering coefficient per degree (cf. Figure 1 (b)) and average path length of 4.2 indicate typical scale free properties of a social network [17]. A more detailed discussion is provided in Appendix A. Titel 19. März 2013 2.1 Benchmark Dataset Our benchmark dataset considers more than only the friendship relationships (cf. Figure 2 for an overview) and will contain several billion RDF triples while retaining typical properties of the underlying social graph from Last.fm. Track User trackID userID name name User_friends artist realname User_neighbours User Artist Artist_similar url age duration country listeners gender playcount url Track album playlists trackMbID playcount User_topTags artistMbID timestamp albumMbID timestamp Tag_similar Tag Album_topTags Album Tag_topAlbums durch Fig. 2: Schema of our Last.fm benchmark dataset Zeichenblatt 1 To transform the obtained social graph into an RDF graph it is crucial to define an ontology for the schema shown in Figure 2. While some entities and relations are easy to define using existing vocabularies like the FOAF-Ontology2 , others require the introduction of new ones. An excerpt of the resulting RDF graph is shown in the following: @prefix foaf: . @prefix lb: . lb:user1 a foaf:Person, lb:User ; foaf:age "30" ; lb:lovedTrack lb:track1, lb:track2, lb:track3, lb:track4 . lb:track1 a lb:Track ; lb:artist lb:artist1 ; lb:topFan lb:user1, lb:user2, lb:user3 . An important aspect for RDF benchmarks is data structuredness as available RDF datasets have a highly varying structure in contrast to the strongly struc- tured relational data model. However, most existing benchmarks also exhibit a high level of structuredness, similar to relational data, that is practically fixed, i.e. scaling the size of the dataset has no real influence [4]. Since we agree that an RDF triple store should be tested against heterogeneously structured datasets, we envision to use the benchmark generator described in [4] to downsize the overall dataset such that it is not only possible to vary the size of the dataset but also the desired level of structuredness. In contrast, increasing the dataset artificially by means of collected statistics is not considered since it contradicts our idea of a benchmark based on real-world social network data. 3 Last.fm Benchmark Queries SPARQL is the W3C recommended query language for RDF. With the proposed recommendation for SPARQL 1.1 [8], the W3C addresses the lack of impor- tant features ranging from intuitive navigational queries of arbitrary length via some (limited) interference possibilities to complex matchings with support for subqueries, aggregation and negation. The importance of efficient and compre- hensive support for these kind of queries justifies the endeavour for an appro- priate benchmark geared towards comparing and improving the performance of SPARQL 1.1 expressions in current RDF triple stores. The following example queries are only intended to illustrate how to exploit SPARQL 1.1 features for exploring interesting graph properties in an intuitive and easy manner within our Last.fm benchmark dataset3 . A. Find all people that are connected to a user via an arbitrary FOAF distance. According to [1], the evaluation of this query might show poor performance using the SPARQL 1.1 specification from 2011. Recent changes to the property path specification adopt the idea of a (non-counting) semantics as proposed in [1] that can be evaluated more efficiently. SELECT DISTINCT ?name WHERE { ?userA foaf:name %username% . ?userA (foaf:knows)* ?userB . ?userB foaf:name ?name FILTER (?userA != ?userB) } 2 http://www.foaf-project.org/ 3 Placeholders are indicated by leading and trailing ”%”. B. What are the top-k track recommendations for a user based on the listening history of his friends? The ranking considers all kind of tracks of a friend but excludes already known tracks. This query exploits the capabilities of expressing negation, alternative property paths and aggregation in SPARQL. SELECT ?track count(*) as ?playcount WHERE { ?userA foaf:name %username% . ?userA foaf:knows ?userB . ?userB (lb:recentTrack | lb:lovedTrack | lb:topTracks) ?track . MINUS { ?userA (lb:recentTrack | lb:bannedTrack | lb:lovedTrack | lb:topTrack) ?track }} GROUP BY ?track ORDER BY DESC(?playcount) LIMIT %k% C. What are the top-k friends of a user with a common music taste and a maximum FOAF distance of 3? The ranking considers common ”loved” tracks. SELECT ?friend count(?friendLovedTrack) as ?commonTracks WHERE { ?user foaf:knows/foaf:knows?/foaf:knows? ?friend . ?user lb:lovedTrack ?userLovedTrack . ?friend lb:lovedTrack ?friendLovedTrack . FILTER (?userLovedTrack = ?friendLovedTrack) FILTER (?user != ?friend) } GROUP BY ?friend ORDER BY DESC(?commonTracks) LIMIT %k% D. Who are the most popular artists with an average fan age above 30 years? The popularity of an artist can be estimated by counting the number of users that refer to him as top artist. The query uses a subquery to first determine those artists with an average fan age above 30 years. Then, an inverse path (object to subject) is used to obtain top artists for users to compute the popularity. SELECT ?artist count(?user) as ?noOfFans WHERE { ?artist ^lb:topArtist ?user . { SELECT ?artist WHERE { ?artist lb:topFan / foaf:age ?age } GROUP by ?artist HAVING ( avg(?age) > 30 ) }} GROUP BY ?artist ORDER BY ?noOfFans 4 Conclusion Although social networks are an increasingly important source of semantic in- formation, they are hardly covered by existing RDF benchmarks. The most comprehensive approach is the Social Network Intelligence Benchmark (SIB) [3] that is built upon an artificial structure-correlated social graph [14]. However, the authors emphasize that the generated network is not “realistic” in the sense that many more (often unknown) structural correlations can be expected in real social networks. For this reason, we will build our benchmark on real data gath- ered from the social music network Last.fm. Since RDF triple stores should be tested against varying levels of data structuredness, we plan to use the bench- mark generator described in [4] such that it is not only possible to downsize the dataset but also to vary the desired level of structuredness. Extending the rather limited path query expressiveness of SPARQL 1.0, the new features of SPARQL 1.1 allow more sophisticated queries for new application fields ranging from graph analysis to recommender functionalities. Although the efficient support of these features will be a key requirement for modern RDF triple stores, they are currently neglected by most existing RDF benchmarks. Therefore, we will especially design a comprehensive set of benchmark queries as indicated in Section 3 that cover a wide range of SPARQL 1.1 features. References 1. Arenas, M., Conca, S., Pérez, J.: Counting Beyond a Yottabyte, or how SPARQL 1.1 Property Paths will Prevent Adoption of the Standard. In: WWW. pp. 629–638 (2012) 2. Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal on Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009) 3. Boncz, P., Pham, M.D., Erling, O., Mikhailov, I., Rankka, Y.: Social Network Intel- ligence BenchMark, http://www.w3.org/wiki/Social_Network_Intelligence_ BenchMark 4. Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets. In: SIGMOD. pp. 145– 156 (2011) 5. Gjoka, M., Butts, C.T., Kurant, M., Markopoulou, A.: Multigraph Sampling of Online Social Networks. IEEE Journal on Selected Areas in Communications 29(9), 1893–1905 (2011) 6. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In: INFOCOM. pp. 2498–2506 (2010) 7. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys- tems. Web Semantics: Science, Services and Agents on the World Wide Web 3, 158 – 182 (2005) 8. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 Query Language. W3C Proposed Recom. (2008), http://www.w3.org/TR/sparql11-query/ 9. Kurant, M., Markopoulou, A., Thiran, P., Thiran, P.: On the bias of BFS (Breadth First Search). In: International Teletraffic Congress. pp. 1–8 (2010) 10. Milgram, S.: The small world problem. Psychology today 2(1), 60–67 (1967) 11. Mislove, A., Marcon, M., Gummadi, P.K., Druschel, P., Bhattacharjee, B., Bhat- tacharjee, B.: Measurement and analysis of online social networks. In: Internet Measurement Comference. pp. 29–42 (2007) 12. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.C.N.: DBpedia SPARQL Bench- mark - Performance Assessment with Real Queries on Real Data. In: International Semantic Web Conference (1). pp. 454–469 (2011) 13. Newman, M.E., Park, J.: Why social networks are different from other types of networks. Physical Review E 68(3), 036122 (2003) 14. Pham, M.D., Boncz, P., Erling, O.: S3G2: A Scalable Structure-Correlated Social Graph Generator. In: Selected Topics in Performance Evaluation and Benchmark- ing, LNCS, vol. 7755, pp. 156–172. Springer Berlin Heidelberg (2013) 15. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2Bench: A SPARQL Perfor- mance Benchmark. In: ICDE. pp. 222–233 (2009) 16. Voigt, M., Mitschick, A., Schulz, J.: Yet Another Triple Store Benchmark? Prac- tical Experiences with Real-World Data. In: Proceedings of the 2nd International Workshop on Semantic Digital Archives. vol. 912, pp. 85–94 (2012) 17. Watts, D., Strogatz, S.: The small world problem. Collective Dynamics of Small- World Networks 393, 440–442 (1998) 18. Weaver, J., Tarjan, P.: Facebook Linked Data via the Graph API. Semantic Web Journal http://iospress.metapress.com/content/T2745678826V6422 19. Ye, S., Lang, J., Wu, S.F.: Crawling Online Social Graphs. In: APWeb. pp. 236–242 (2010) A Analysis The Last.fm network contains many entities and relationships. For the analysis we focus on reciprocated friendship relationships in order to grasp the social aspect. We crawled 1.860.215 users and 13.690.576 friendship relationships using Breadth-First Search (BFS). Degree distribution is crucial in order to characterize a network as social network. The degree of a node is defined by the number of links incident to a node. On Figure 1a the degree distribution is skewed with the majority of nodes having a low degree while very few nodes have significantly higher degree. This is a typical behaviour of social networks. Clustering coefficient is another important characteristic of social graphs and represents the tendency of nodes to form tight clusters. This metric is defined as the number of links that exist between a node’s neighbours divided by the maximum possible links that could exist among a node’s neighbours. The clustering coefficient in a social network is higher than in other types of networks [13]. Figure 1b depicts average clus- tering coefficient with regard to degree. We can observe that low degree nodes demonstrate higher clustering coefficient which means that there is a significant clustering among them. On the other hand, as the number of neighbours in- creases clustering coefficient drops. These results are consistent with previous research on social networks [11]. Lastly, short paths in the network indicate that nodes are reachable through a small number of hops. The average path length of the network is 4.2 and is even shorter than the expected famous ”six degrees of separation” of Milgram’s experiment [10, 17]. This surprisingly low average path length is probably influ- enced by the bias of BFS towards the high degree nodes which tend to reduce distances in the network. Another characteristic of social networks is the largest shortest path, the so called diameter. The network has a diameter of 8 which again is low in comparison with other social networks [11] also probably due to the bias introduced by the crawling technique. The aforementioned characteris- tics skewed degree distribution, high clustering coefficient and short path lengths are typical social network properties and indicate that the Last.fm network has small world and scale free properties [17] as mentioned in the main analysis part (cf. Section 2).