Not-So-Linked Solution to the Linked Data Mining Challenge 2016 Jedrzej Potoniec Institute of Computing Science, Poznan University of Technology ul. Piotrowo 2, 60-965 Poznan, Poland Jedrzej.Potoniec@cs.put.poznan.pl Abstract. We present a solution for the Linked Data Mining Challenge 2016, that achieved 92.5% accuracy according to the submission system. The solution uses a hand-crafted dataset, that was created by scraping various websites for reviews. We use logistic regression to learn a classi- fication model and we publish all our results to GitHub. 1 Introduction As indicated in the challenge website, Linked Data Mining is a novel and chal- lenging research area, mainly due to large amount, variety and heterogeneity of the data. Unfortunately, there are also very basic, almost technical, problems with the data: they do not comply with the standards, there is a lot of mistakes introduced during extraction and transformation from an original format to the Linked Data, websites publishing the data are frequently down [1]. Because of that, we decided to take another path in our solution. We did the extraction by ourselves, thus receiving dataset well-suited for the challenge, as described in Section 2. We performed normalization and applied a very popular logistic regression method to train a classification model, as described in Section 3. Throughout the rest of the paper, we use a prefix dbr: for http://dbpedia. org/resource/ and dbp: for http://dbpedia.org/property/. Web scraping scripts, created dataset, machine learning process and model are available on GitHub: https://github.com/jpotoniec/LDMC2016. 2 Datasets 2.1 Training and test data We observed some irregularities and unexpected things in the datasets provided by the challenge. For the album In Some Way, Shape, or Form by Four Year Strong the data pointed to the resource dbr:In_Some_Way,_Shape_or_Form. Unfortunately, in the DBpedia [2] there are two corresponding resources, the other one being dbr:In_Some_Way,_Shape,_or_Form (note the second comma). The DBpedia website does some sort of redirection, so when visiting with a web browser both URIs point to http://dbpedia.org/page/In_Some_Way,_Shape, _or_Form. Conversely, the SPARQL endpoint1 treats both URIs separately, the first one occurring in 18 triples and the second one in 100 triples. For an artist St. Vincent there are two albums in the datasets: Strange Mercy in the training data and St. Vincent in the testing data. Unfortunately, both have the same URI dbr:Strange_Mercy. We think there may be a few similar issues, as there are eight URIs that occur more than once in the datasets. 2.2 Linked datasets In the beginning, we planned to extract features from DBpedia using Fr-ONT- Qu [4] from RMonto [6], a plugin to RapidMiner [5]. Unfortunately, the most promising feature discovered this way was a binary information if an album has a review score from Pitchfork 2 or not. After investigating, we discovered that during the extraction from Wikipedia to DBpedia a relation between a reviewing website and a review score has been lost. Consider triples for the Strange Mercy album3 : there are 11 triples with a property dbp:rev and a few triples with properties like dbp:rev10score, but one has absolutely no way to connect scores to the reviewing websites. The very same problem happens with properties dbp:title (12 values) and dbp:length (11 values): it is impossible to decide on a length for a concrete track. Due to the lack of space, we present a detailed analysis in the suplementary material available in GitHub. We thought about using Yago [7], but it seemed to lack review information. We also tried to use DBTune 4 , as suggested by the challenge website, but it ren- dered out to be a dead end. For example, MusicBrainz data, the most interesting dataset there, is a Service Temporarily Unavailable for a very long time now. 2.3 Non-linked datasets Instead of trying to use existing Linked Data, we decided find data to solve the challenge, and then make it available to the community. As the datasets for the challenge are quite small (1592 different URIs), we did some web scrapping with Python and Scrapy 5 to obtain reviews of considered albums. We scraped Wikipedia to obtain reviews collected from various websites. It rendered out to be a tedious process, as these reviews have various formats, frequently with some additional information (like a date or an URL to a review), or spelling mistakes. We performed normalization to a range [0, 1], by dividing in case of reviews on scales from 0 to 10 or 100 and by assigning arbitrarily numeric values to descriptive reviews (like favorable). We also did some heuristic to normalize reviewing websites, e.g. BBC and BBC Music. We strictly avoided using Metacritic reviews available in Wikipedia: these reviews use MC field in Album ratings template, while we used only fields starting with rev [10]. 1 http://dbpedia.org/sparql 2 http://pitchfork.com/ 3 http://dbpedia.org/page/Strange_Mercy 4 http://dbtune.org/ 5 http://scrapy.org/ We collected number of reviewers and an average score from Amazon 6 by scraping the website using titles and artists provided in the challenge datasets. We also used Discogs 7 API and provided titles and artists to gather how many people own an album and how many people want it. Finally, we used datadumps provided by MusicBrainz 8 [8] and identifiers from Wikidata [9] available in the datadumps and in DBpedia, to obtain number of users owning an album and it’s average score. The whole dataset consists of 94 numerical attributes. 3 Machine learning process and model To build a classification model, we used a typical machine learning setup for classification. We performed a Z-transformation on all attributes, that is for every attribute we computed an average value µ and a standard deviation σ, and then replaced every value v of the attribute with v−µ σ . This way all attributes have an average 0 and a standard deviation 1. Further, we replaced missing values with 0. Finally, we used logistic regression [3] to train the model. To estimate performance of our solution we applied 10-folds cross-validation, which estimated accuracy to be 91.7±2.17%. This value is consistent with 92.5% on the test set reported by the challenge submission system. The whole process have been implemented using RapidMiner 5 and is available in GitHub. An important part of logistic regression is to assign coefficients to attributes of an input dataset. Values of these coefficients provide an insight which at- tributes are most important for the model. In our case, the absolute value of the highest coefficient is 2.859 and the lowest 0.022. As all the attributes are on the same scale, this clearly shows that some of them are more important than the others. There were six attributes having coefficients above 1, we present them in the Table 1. Five of these attributes were review scores web scrapped from Wikipedia, only the attribute from Discogs came from other source. These at- tributes clearly indicate that Metacritic reviews are quite consistent with other sources of reviews. The attribute with the highest coefficient is an review value from Pitchfork, what is consistent with the most important attribute from Linked Data, as mentioned in Section 2.2. The attribute from Discogs indicates how many people own an album and is probably a tendency of people to buy and brag about albums that have good reviews. The attribute with the lowest weight −0.442 is a number of reviews of an album on Amazon. As Amazon is a shop, it probably shows a tendency of people to complain about bad things and to not appreciate good things. 4 Conclusions Apparently, we are not there yet with the Semantic Web. In theory, most of the features we used were already available in the Linked Data. In practice, 6 http://www.amazon.com/ 7 https://www.discogs.com/ 8 https://musicbrainz.org/doc/MusicBrainz_Database/Download Table 1. The attributes having coefficients in logistic regression model above 1. These coefficients were all positive, what means that the higher they are, the higher proba- bility of a given album being a good one. attribute coefficient review score from Pitchfork pitchfork.com 2.859 review score from AllMusic www.allmusic.com 2.437 review score from Stylus www.stylusmagazine.com 1.926 number of people owning an album according to Discogs www.discogs.com 1.465 review score from Entertainment Weekly www.ew.com 1.274 review score from The Guardian www.theguardian.com 1.096 they were not. The issues with the Linked Data discussed in the paper clearly suggests that even a very simple and crude solutions doing web scrapping can easily outperform solutions based on the Linked Data. The presented solution consists of 641 lines of Python code and can classify correctly 296 out of 320 test albums, which we find to be quite a good result given a small amount of time invested and irregularities in the provided datasets. It is also worth to note, that the baseline solution was able to classify correctly 222 test albums (69.375%), so our solution offers quite an improvement. References 1. Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other peoples dirty data. In: The Semantic Web–ISWC 2014, pp. 213–228. Springer (2014) 2. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell- mann, S.: Dbpedia-a crystallization point for the web of data. Web Semantics: science, services and agents on the world wide web 7(3), 154–165 (2009) 3. Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Sta- tistical Society. Series B (Methodological) pp. 215–242 (1958) 4. Lawrynowicz, A., Potoniec, J.: Pattern based feature construction in semantic data mining. Int. J. on Sem. Web and Information Systems (IJSWIS) 10(1), 27–65 (2014) 5. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid pro- totyping for complex data mining tasks. In: Proc. of the 12th ACM SIGKDD int. conf. on Knowledge discovery and data mining. pp. 935–940 (2006) 6. Potoniec, J., Lawrynowicz, A.: RMonto: Ontological extension to RapidMiner. In: Poster and Demo Session of the 10th Int. Semantic Web Conf. (2011) 7. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proc. of the 16th int. conf. on World Wide Web. pp. 697–706. ACM (2007) 8. Swartz, A.: Musicbrainz: A semantic web service. Intelligent Systems, IEEE 17(1), 76–77 (2002) 9. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM 57(10), 78–85 (2014) 10. Wikipedia: Template:album ratings, https://en.wikipedia.org/w/index.php? title=Template:Album_ratings&oldid=670493671, [Online; accessed 2016-03-07]