IMGpedia: Enriching the Web of Data with Image Content Analysis Sebastián Ferrada, Benjamin Bustos, and Aidan Hogan Center for Semantic Web Research Department of Computer Science University of Chile {sferrada,bebustos,ahogan}@dcc.uchile.cl Abstract. Linked Data rarely takes into account multimedia content, which forms a central part of the Web. To explore the combination of Linked Data and multimedia, we are developing IMGpedia: we compute content-based descriptors for images used in Wikipedia articles and subsequently pro- pose to link these descriptions with legacy encyclopaedic knowledge-bases such as DBpedia and Wikidata. On top of this extended knowledge-base, our goal is to consider a unified query system that accesses both the ency- clopaedic data and the image data. We could also consider enhancing the encyclopaedic knowledge based on rules applied to co-occurring entities in images, or content-based analysis, for example. Abstracting away from IMG- pedia, we explore generic methods by which the content of images on the Web can be described in a standard way and can be considered as first-class citizens on the Web of Data, allowing, for example, for combining struc- tured queries with image similarity search. This short paper thus describes ongoing work on IMGpedia, with focus on image descriptors. 1 Introduction Wikipedia centres around the curation of human-readable encyclopaedia articles, where at the time of writing it contains about 38 million articles in almost 200 languages. In terms of structured content, the DBpedia project [3] systematically extracts meta-data from Wikipedia articles and presents it as RDF. Another initia- tive, called Wikidata [8] – organised by the Wikimedia Foundation – allows users to directly create and curate machine-readable encyclopaedic entries. In terms of multimedia content, Wikimedia Commons1 is a collection of 30 million freely- usable media files (image, audio, video) that are used within Wikipedia articles. Recently, DBpedia Commons [7] was released: a knowledge-base that takes the meta-data from each multimedia file’s description page – such as the author, size and licensing of the media – and publishes it as RDF. However, none of the existing knowledge-bases in this space describe aspects of the multimedia content itself. Because of this we are developing IMGpedia [1], a new knowledge-base that uses meta-data from DBpedia Commons and enriches it with visual descriptors of all images from the Wikimedia Commons dataset. The goal is to allow users to perform “visuo-semantic queries” using IMGpedia, e.g., retrieve a painting for every European painter from the 17th century or given a photo of yourself, obtain the top-k most similar portraits painted by an American artist. We could also infer new knowledge about DBpedia entities, given the relations between the images; 1 https://commons.wikimedia.org e.g, if two entities of type dbo:Person share an image (or they are very similar), we could infer, with some given likelihood, that they have met. Our general goal is to investigate methods by which descriptors of the content of images – not just their meta-data – can be integrated with the Web of Data in a standard way. In previous work [1], we proposed the general goals of IMGpedia. This paper is a brief progress update: currently we are downloading the images from Wikimedia, computing the visual descriptors for each image, and creating the image entities that will be linked with existing encyclopaedic knowledge-bases. We thus propose a set of existing multimedia algorithms for extracting descriptors from images, describe their applications, and provide reference implementations in various programming languages such that they be used as a standard way to (begin to) describe images on the Web of Data. We provide some initial results on computing descriptors for the images of Wikimedia Commons, we describe some of the technical engineering challenges faced while realising IMGpedia, and we outline future plans. 2 Reference Implementations for Visual Descriptors Aside from simply providing the URL of an image, we wish to compute visual descriptors for each image. These descriptors are vectors produced after performing different operations over the matrix of pixels in order to obtain certain features, such as color distribution, shapes and textures. Later, those vectors are stored as a part of a metric space in which similarity search can be performed. We are considering four different image descriptors which we have already implemented in Java, Python and C++, and we have proven to give equivalent results among different languages over a dataset of 2800 Flickr images. The first three descriptors require a preprocessing step, which is to convert the image to greyscale using intensity Y = 0.299 · R + 0.587 · G + 0.114 · B, where R, G and B are the values of each color channel: red, green and blue, respectively. The four descriptors we compute are: Gray Histogram Descriptor: images are partitioned in a fixed amount of blocks. Per each block a histogram of gray intensity is computed; typically intensity takes 8 bit values. Finally, the descriptor is the concatenation of all histograms. Oriented Gradients Descriptor: image gradient is calculated via convolution with Sobel kernels [6]. A threshold is then applied to the gradient, and for those pixels that exceed it, the orientation of the gradient is calculated. Finally, a histogram of the orientations is computed and is used as the descriptor. Edge Histogram Descriptor: for each 2 × 2 pixel block, the dominant edge ori- entation is computed (horizontal, vertical, both diagonals or none), where the descriptor is a histogram of these orientations [4]. DeCAF7: a Caffe neural network pre-trained with the Imagenet dataset [2] is used. To obtain the vector, each image is resized and ten overlapping patches are extracted; each patch is given as input to the neural network and the last self-convolutional layer of the model is extracted as a descriptor, so the final vector is the average of the layers for all the patches [5]. The implementations and their dependency/compilation documentation can be found at https://github.com/scferrada/imgpedia under GNU GPL license. 3 Performance Evaluation of Descriptors In order to build the knowledge-base, preliminary steps must be done. First of all, it is necessary to obtain a local copy of the Wikimedia Commons image dataset Table 1. Descriptor calculation benchmark Average time per image [ms] Descriptor Single Thread Multithread GHD 119.4 10.4 HOG 477.7 25.7 EHD 729.1 42.1 DeCAF7 3634.8 3809.7 so we can keep IMGpedia updated in the future. To do so, Wikimedia provides an rsync-protocol server for downloading the ∼21TB of images of the dataset. Currently we have downloaded about ∼19TB at an average speed of ∼550GB per day, corresponding to ∼14 million images downloaded from a total of ∼16 million. Each image is stored locally, along with its four descriptors. We benchmarked the computation of descriptors in order to know the time such extraction will take. Experiments were performed on a machine with Debian 4.1.1, a 2.2 GHz (24 core) Intel(r) Xeon processor, and 120 GB of RAM. We computed the descriptors for a sub-folder of the Wikimedia Commons dataset, containing 57,377 images. The process is comprised of three steps: read the image and load it into memory, extract the descriptor itself and save the vector on disk. Two experiments were performed, one using a single execution thread and the other using 28 threads, where each thread handles an image at a time. The neural network for DeCAF7 was used in CPU mode (rather than GPU mode). Results are shown in Table 1, where we see that DeCAF7 is the most expensive descriptor to run. While we can see an improvement of one order of magnitude in calculation time for the first three descriptors, DeCAF7 does not benefit: the Caffe implementation for neural networks already uses multithreading for its computation on a single image, so assigning different images to different threads gives a slight overhead since multiple cores are already being exploited. On the present hardware, we estimate it would take ∼1.8 years to run DeCAF7 over all 16 million images; hence, we will have to select a subset of images on which to run this descriptor. 4 Next Steps Linking IMGpedia with DBpedia, Wikidata, etc. To complete our knowledge- base and be able to infer new relations, we must combine these visual descriptors with the semantic entities of DBpedia and Wikidata. For this, we must extract the links between the articles and the images that they use since they are not present in DBpedia Commons and only present in DBpedia for some images. This is an ongoing effort in which we are researching the possible options to perform this task: we hope to exploit the Wikimedia dumps, either to derive this information directly, or to parse it from the articles themselves. Subsequent to this, we wish to investigate using descriptor-based user-defined-functions in SPARQL to enable visuo-semantic queries that combine, e.g., image similarity with knowledge-base queries. Efficient algorithms for finding relations between similar images To facilitate query- ing IMGpedia, we propose to compute static relationships between images; e.g., if two images have a descriptor distance that tends to 0, then we can state that those images are nearCopy of each other. Other distance-based relations can be computed based on other fixed thresholds, such as similar. The brute-force approach for find- ing all pairs of images within a given distance takes quadratic comparisons. Thus, the challenge is to do this efficiently, where we propose to explore: building an index structure for similarity search over the dataset; using approximate similarity search algorithms; using self-similarity join techniques; and so forth. Labelling images for multimedia applications By linking IMGpedia to existing knowledge-bases, we hope to label images with categories, types, entities, etc. Thus IMGpedia could serve as a useful resource for the multimedia community; e.g., since we are using Caffe framework for neural networks to compute DeCAF7 [2], we could train a network using the Dbpedia Commons categories extracted for each image as labels, allowing us to automatically classify new images in the dataset or benchmark performance against other classifiers. We could also label images with the specific entities they contain using DBpedia/Wikidata based on the article(s) in which it appears: we could need to tread careful since, e.g., an image used on an article of a person may not contain that person, but we could consider some heuristics such as the location of the image and some further content analysis. 5 Conclusions In this short paper, we have given updates of our ongoing work on IMGpedia. We benchmarked four image descriptors for which reference implementations are made public. We discussed a number of open engineering challenges, as well as some open research questions we would like to study once the knowledge-base is completed. Acknowledgments This work was supported by the Millennium Nucleus Center for › Semantic Web Research, Grant NC120004, and Fondecyt, Grant 11140900. We › thank Camila Faundez for her assistance. References 1. Bustos, B., Hogan, A.: IMGpedia: a proposal to enrich DBpedia with image meta-data. In: AMW. vol. 1378, pp. 35–39. CEUR (2015) 2. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013) 3. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hell- mann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal (2014) 4. Manjunath, B.S., Ohm, J.R., Vasudevan, V.V., Yamada, A.: Color and texture descrip- tors. IEEE Trans. on Circuits and Systems for Video Tech. 11(6), 703–715 (2001) 5. Novak, D., Batko, M., Zezula, P.: Large-scale image retrieval using neural net descrip- tors. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1039–1040. ACM (2015) 6. Sobel, I., Feldman, G.: A 3×3 isotropic gradient operator for image processing. Stanford Artificial Project (1968) 7. Vaidya, G., Kontokostas, D., Knuth, M., Lehmann, J., Hellmann, S.: Dbpedia commons: Structured multimedia metadata from the wikimedia commons. In: The Semantic Web- ISWC 2015, pp. 281–289. Springer (2015) 8. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Comm. ACM 57, 78–85 (2014)