Creating and Using Sports Linked Data: Applications and Analytics Panagiotis-Marios Charalampos Bratsas Andreas Veglis Philippides Mathematics Department School of Journalism & Mass OKF Greece Aristotle University of Thessaloniki Communications, Thessaloniki, Greece OKF Greece Aristotle University of Thessaloniki Thessaloniki, Greece filippidis.okfgr@gmail.com veglis@jour.auth.gr charalampos.bratsas@okfn.com Evangelos Chondrokostas Dimitra Tsigari Ioannis Antoniou Mathematics Department Mathematics Department Mathematics Department Aristotle University ofThessaloniki Aristotle University of Thessaloniki Aristotle University of Thessaloniki echondrok@gmail.com dimitra.tsi@gmail.com iantonio@math.auth.gr situations, performance and results in ways and methods that can potentially then be used in other scientific fields. ABSTRACT A typical example is basketball, a sport full of statistics that Linked data have made significant progress over the last few can largely be represented through the boxscore, a table years and many kinds of datasets are transformed into this format containing the performance in every statistical category for every at a highly increasing rate, contributing to the openness, player and team of a game. This amount of statistical data is connectivity and re-use of web data. However, this progress is not sufficiently large so that very precise and detailed analytics about the case for a popular sport like basketball, at least as far as the the sport of basketball can be made. However, basketball data are raw statistics is concerned. This kind of data contains valuable not usually available in large quantities, at least in their raw form information that can be used by fans, teams and coaches, and this leads the related scientific researches to devote a very statisticians and other scientists. In this work, statistical data from large part of their time in searching for these data, that is not easy Euroleague are transformed into linked data, thereby filling the afterwards to share, or to link with similar data. That generates the relevant gap in the LOD Cloud, while ways of exploitation of need to transform such data in linked data form and this is the them are presented, from fascinating applications for the fans, like primary subject of this work, using Euroleague statistics, the top the Euroleague Timeline, to cases of complex processing, analysis European basketball competition for clubs. and visualization of data through software like R. The benefits of the semantic enrichment of these data is more Categories and Subject Descriptors than obvious, since the openness of large volumes of structured [World Wide Web]: Web data description languages – Resource data is valuable not only to the coaches, statisticians and other Description Framework (RDF) scientists of basketball, who could have an easy and direct access to data relevant to their job, but also to a large audience such as [Information Retrieval]: Document representation – Document basketball fans, who could make in depth analyses of their structure, Ontologies favourite sport on their own. The statistical nature of these data [Probability and Statistics] : Statistical paradigms – Statistical increases the value of their openness, since they can be processed Graphics, Exploratory Data Analysis in many ways, from anyone interested, to lead to further inference about the sport. The linked data technologies themselves include General Terms means by which such information can be easily used and Measurement, Design, Experimentation. processed. Besides the statistics of boxscores, additional data relating to Keywords the games of the competition such as the court and the date they Sports Open Data, Linked Data, Analytics, Data Visualizations took place have been transformed into linked data form too. This kind of information is essential and can link basketball data to other LOD datasets in many ways, so further information about games, teams or players can be reached and retrieved. This 1. INTRODUCTION connection complements the statistical information and enriches Sports data can be valuable not only to anyone related to the provided knowledge, while, a common way of modeling such sports, but also to the scientific community, because the statistical data can benefit the comparison of similar data and increases their information they include can widely describe what has happened processing, analysis and visualization capabilities in favor of in a sports game. Processing, modeling and visualization of these every stakeholder of the sport. Additionally, it provides a data can benefit areas such as sports analysis, either from the complete informational source for creating fascinating perspective of the players and their performance, or from the applications for basketball fans and encourages in turn initiatives perspective of coaches and their tactics and can draw inference to create and make use of linked data, thus benefits the LOD about the finding of the best players and teams, the detection of cloud itself, with further data enrichment and linkage. One such the sport's important elements, or the prediction of game 38 Figure 1. System Architecture application is the Euroleague Timeline, as presented below, while properties such as the teams of a game, the final score, the date some examples of data analytics and visualizations of basketball and time, the court and the week it took place. Other basic classes data are introduced, as the second subject of this work. is the Phase class, containing the name of the phase and its starting and ending week, the Group class, containing teams and The entire system architecture is shown in Figure 1. The first the phase of the competition, the Team class, with the names of stage is about data retrieval and processing, while the second teams, their players and their courts, the Player class, which stage involves the creation of linked data. Then, these data can be contains the names of players and the teams they are part of and used in many ways, like the Euroleague Timeline or analytics and the Court class, including the name and the geo-coordinates of the visualizations through software like R, while they are also linked court. The whole ontology schema is illustrated in Figure 2. with the LOD Cloud and especially DBpedia. Based mainly on this ontology, but also using additional 2. EUROLEAGUE LINKED DATA ontologies such as foaf1, skos2, event3 and timeline4, the mapping of relational data to RDF was made, leveraging the quad map The whole procedure of creating Euroleague linked data patterns of Virtuoso. A unique RDF graph for each season of the from raw data statistics is presented in this section. competition has been created. 2.1 Creating the RDF Graphs Updating data with new games and statistics needs to perform almost the whole procedure, only for the specific amount Initially, there was a handcrafted extraction of data from the of data, namely, the data retrieval and processing tasks, the official website of Euroleague. Basic data of the games stored insertion to the database, although different php files have to be directly in databases, while boxscores statistics initially saved in executed in order to update the database and finally, the recreation text files to undergo the necessary processing. The cleaning and of the year's rdf graph. filtering process of the extracted boxscores involved tasks such as insertion of delimiter characters between the statistics, renaming 2.2 Datacube Integration "team" value with the corresponding team name, filling of empty values with zero values and separation of each shooting column The next step is to integrate the Datacube vocabulary5, the (eg 10/11 2FG means 10 made two pointers and 11 attempted two most appropriate ontology on statistical data. This process is a pointers). After that, the statistics data of text files stored in work in progress, the structure of the cube however has been quite databases too. For each season of the competition a unique defined. Most of the dimensions of the cube concern mainly the database has been created. The main tables of a database are the games data (date, time, teams, court, etc), but there is an extra boxscores table, containing the statistical information, so that each dimension, the player dimension, referring to a subject of the row of the table is a statline of a player or a team in a Euroleague statistics recorded in the game which is identified by the other game and the schedule table, containing the basic data of every dimensions. Since the statistics of the boxscore (points, rebounds, game, such as time and date. Additional key tables have been turnovers etc) have been defined as the measures of the cube, each created about the teams, the players, the phases of the observation is a statline of a player or a team, defined by the game competition, the groups of teams and the courts. The structure and it's been recorded. There are additional attributes in the cube that data of all databases then stored in a Virtuoso Server. provide supplementary information, such as the player's team in a The first step in transforming Euroleague data in linked data game, the number of his jersey and the unit of measure that is was the creation of an ontology, under which the mapping of defined separately for every statistical measure. relational data to RDF would take place. The basic classes created 1 in the ontology are conceptually related with the main database http://xmlns.com/foaf/0.1/ tables, like the Statline class, whose properties are similar to the 2 http://www.w3.org/2004/02/skos/core# columns of the boxscores table, namely statistics such as the 3 points of the player or team in a game. Statline class has http://purl.org/NET/c4dm/event.owl# additional properties, such as the game and the week the statline 4 http://purl.org/NET/c4dm/timeline.owl# has been recorded. The same holds for the Game class, containing 5 http://purl.org/linked-data/cube# 39 Figure 2. Ontology Schema The dimensions of the cube may have as range the classes of virtuoso containing the Euroleague data and on the DBpedia the ontology that has already been created and thus contain, in endpoint, to extract further information on players and teams. The this way, its properties, apart from the cube, while some code lists final result contains all the relevant information (basic data, have been created for specific dimensions such as the season, the statistics, additional data from DBpedia) and is shown in Figure 3. week, the phase, the group and the time dimension. Meanwhile, The application is online at wiki.el.dbpedia.org/apps/Euroleague. some slices that is likely to be used refer to players, teams, weeks and games, with each slice containing the relevant information 3.2 Data Visualizations and finally, all the above concepts have been defined as skos The plethora of statistical data that has been transformed into concepts, according to the Datacube specifications. linked data is suitable for data processing and analytics and this 3. APPLICATIONS AND ANALYTICS task was carried through R Studio, which can make sparql queries to any endpoint through its packages. The retrieved data may then The Euroleague linked data that have been created can be undergo any mathematical processing and visualization. Extra exploited in many ways, such as applications and analytics, as visualization capabilities are enabled through the R Shiny presented below. package, via its widgets, while the R Shiny Dashboard allows handling many visualizations that interact, simultaneously, thus 3.1 Euroleague Timeline serving as a complete information visualization framework, that can utilize other applications as well to display information such An application that makes the most of this work and all the as Google Vis. advantages of linked data is the Euroleague Timeline, which is a Using this technology, some visualizations exploiting timeline of the results of Euroleague games, containing Euroleague linked data have been generated, providing useful information both on the basic data of the games and on their information and insights for basketball fans or even coaches such boxscores. The Euroleague Timeline involves two types of as: timelines, the team timeline, including all the games of a team in the competition and the season timeline, containing all the games  Table of rosters of teams for each season with the of a season. In any case, games are displayed in a time series and average statistics of the players after the user selection of a season or a team, he or she can  Points and shots distributions of teams along with their navigate through the games either consecutively, or by selection, shooting percentages via the special time bar featured by the Timeline, which contains  Relations between turnovers and points and fouls and all the games of the season or the team. points for the teams of a game, for all games of a season The Euroleague Timeline is based on the TimelineJS, which  Graph of the results of teams in the competition loads json files to get and display the information, so that was the file format needed to extract data from Virtuoso. A separate json  Players comparison via diagrams, based on their file has been created for every team and every season, while it is average statistics possible to update them with new games data. The information  Map containing the teams of each season and each stored in each json file and appearing in the timeline is retrieved phase of the competition, with additional information through a series of queries, both on the sparql endpoint of from dbpedia 40 3.3 Further Analytics Besides these visualizations, Euroleague statistics are used for further mathematical processing and analysis in order to examine measures, ratings and relations that could yield useful results. Some examples already done on these data by this work are:  research on the teamwork of teams and its relation to their success, relying on categories like assists, points and turnovers  research on individual defensive actions and on relations between the steals, the blocks and the fouls, along with predicting the number of steals and blocks of a player under the fouls and the court he plays at (home/away)  Evaluating the best players per position on the basis of normalized equations of their statistics  Creating and analyzing a network of players who have played in Euroleague  research on the correlation of the shooting percentages of the two teams of a game with the final result and their points difference 4. FUTURE WORK A large volume of basketball data has been transformed into linked data, however it could be further enriched, especially with Figure 3. A game in the Euroleague Timeline the play-by-plays of games, which contain all the actions of the players that are statistically recorded, in order of time. This would creating fascinating applications, like the Euroleague Timeline, or increase significantly the information processing, analysis and by processing and analyzing these statistics through R, to draw visualization capabilities. The examples that have been made so useful inference and display the corresponding diagrams, thus far in this work is only the beginning and there are still countless demonstrating the enormous range of capabilities offered by topics on these data that can be explored, as well as many other linked data in a sport that is full of statistical information. ways of analysis. Their combination is the step forward and can lead to applications and results that will reveal and provide 6. REFERENCES additional knowledge on basketball, which would be readily [1] Bizer C., Heath T., and Berners-Lee T. 2009. Linked Data - accessible to every fan, through tools that leverage linked data. the story so far. Int. J. Semantic Web Inf. Syst, 5(3):1-22 [2] Klyne G. and Carroll J. 2004. Resource Description 5. CONCLUSION Framework (RDF): Concepts and Abstract Syntax. Basketball linked data offer a variety of possibilities in sports, statistical and scientific field, because of their large [3] Lehmann J., Bizer C., Kobilarov G., Auer S., Becker C., volume of statistics. This work transforms Euroleague basketball Cyganiak R., and Hellmann S. 2009. DBpedia - a data into linked data to enrich the LOD Cloud with valuable crystallization point for the web of data. Journal of Web sports statistics and to utilize these data in various ways, such as Semantics, 7(3): 154-16 41