Similarity Management of Data the DISA Experience Pavel Zezula FI DISA, Masaryk University, Brno, Czech Republic Abstract. As the current data is typically weekly structured or un- structured at all, access to objects is only possible through similarity of the object’s salient features or properties. Consequently, similarity ap- proach to searching is increasingly playing more and more important role in development of data processing applications. In the last twenty years, the technology has matured and many centralized, distributed, and even peer-to-peer architectures have been proposed. However, the use of simi- larity searching in numerous potential applications is still a challenge. In the talk, four research directions in developing similarity search applica- tions at Masaryk University DISA laboratory are to be discussed. First, we concentrate on accelerating large-scale face recognition applications and continue with generic image annotation task for retrieval purposes. In the second half, we focus on execution of similarity query streams and finish the talk with an ambition topic of content-based retrieval in human motion-capture data collections. Applications will be illustrated by online prototype implementations. 1 Introduction In the future, access to digital media stored in memories and circulating among networked computers will have to follow, or at least get much closer in its form, to the behavior of real life evolution and communication between species. There, recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity, because any event in the history of organ- ism is, in a sense, unique and the specific case of the exact/partial match is marginal. As practically any kind of fact can nowadays become a digital part of the networked media – whatever we see, say, measure, observe, test, or otherwise experience, is or at least can be in digital form – computers must provide access to required data through similarity based operations, because it is the similarity that is in the world “revealing”. Similarity in general is determined by stimuli materialized as extractions from digital objects in form of features ( descriptors, properties, etc.), and the way we Copyright c 2019 for the individual papers by the papers’ authors. Copying per- mitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. grade closeness of objects in user defined operations able to satisfy information needs. We can calibrate the stimuli and operations from two different points of view: (1) effectiveness concerns the way similarity is defined and the extent to which it reflex the human perception of the relation, and (2) efficiency regards the processing speed, costs, or an effort needed to get results independently from scale. Philosophical and psychological theorizing about similarity has been domi- nated by the geometric model for much of the twentieth century, see e.g. [22]. Though later criticized by several authors [23,7], it has become important paradigm for developing an extensible form of similarity search technology. The central as- sumption of this type of model is that the similarity can be related by a linear or monotonic decreasing function to inter-point distances in the metric space [14], that is, the larger the measure of similarity between two objects, the smaller the distance between the co-responding points in the metric space. Though the origins of the topic in computer science are older, the boom started in the 1990s with the M-tree [4] and resulted in many interesting scientific and technolog- ical achievements. The metric space paradigm extends the range of indexable similarity measures but at the same time loses the supportive advantage of co- ordinate systems to define partitioning of search spaces. The main advantage is that such approach is also able to consider data domains, which are not sortable – typical for a majority of contemporary digital data seen through their content descriptors. Since the similarity is in fact measured as a dissimilarity, specifically a distance, the applied techniques are often designated as distance searching. Several key publications summarize achievements in this area. The first sur- vey [3] includes results till the year 2000. It presents known approaches in original taxonomy with the objective to discover core properties that would allow com- bination of existing principles to form future better proposals. The second sur- vey [6] divides existing methods for handling similarity search into two classes. The first class directly indexes objects based on distances (distance-based in- dexing), while the second is based on mapping to a vector space (mapping- based approach). However, the main part of this article is dedicated to a survey of distance-based indexing methods, and the mapping-based methods are only outlined. In 2006, a book named Similarity Search: The Metric Space Approach [24] presented the state-of-the-art in developing index structures and supportive technologies for searching complex data modeled as instances of a metric space. The metric searching problems are also considered in the last edition of the en- cyclopedic book by Hanan Samet [15] called Foundations of Multidimensional and Metric Data Structures. 2 Similarity Search in Applications Though a lot of progress has been made [1,12] and several interesting similarity search demonstration prototypes, see Fig. 1 for illustration, are already available, [10,11], the fact still is that the extensibility property – one system used for many applications – of the metric space approach to similarity searching, has not yet been fully exploited [13]. Fig. 1. Image similarity search An application of similarity searching in several dimensions have been inves- tigated in the Data Intensive Systems and Applications (DISA) Laboratory of Masaryk University, Brno. The first objective is to speedup face retrieval in large collections of common photographs. They also develop image annotation systems and study mechanisms which should be applied for similarity searching in data streams. Finally, they consider a very complex data type, called motion capture data, to develop scalable similarity search, filtering and annotation/classification mechanisms. In the following, we shortly outline each of the activities. In the talk, the content-based image similarity search as well as its applications will be demonstrated by on-line demonstrations. Fig. 2. Face detection and similarity retrieval 2.1 Similarity Searching in Images of Human Faces Face recognition is a problem of verifying or identifying a face appearing in a given image. In our project, sketched in Fig. 2, we do not develop new face detection or comparison modules but rather demonstrate how a synergic col- laboration of existing systems combined with similarity searching can scale into the dimension of large image collections [19]. Specifically, by integrating three OpenCV, NeuroTech and Luxand similarity measures, we achieve high-quality and more stable results, compared to the measures evaluated independently. We also demonstrate, how effectiveness can significantly improve by employing the concept of multi-face queries along with optional relevance feedback. Finally, a metric indexing structure is applied to achieve scalability of the similarity search. 2.2 Image Annotation Database of Semantic resources annotated images (WordNet) Content-based image retrieval ConceptRank Similar annotated images Semantic graph d = 0.2 0 0 d = 0.5 d = 0.6 relatedTo person dandelion, flower, partOf nature relatedTo bloom, plant, herb, partOf isA yellow, taraxacum, 0 0.2 0.1 0.1 seasons, grow, flower meadow garden Mary colour, spring, grass, partOf Bloom Meadow, Mary’s photo, nature, petals, partOf partOf grassland, region, dandelion garden isA 0.3 0.2 flora, meadow bloom dandelion Initial candidate keywords Bloom (0.3), meadow (0.2), dandelion (0.2), Random walk for keyword score computation Mary (0.1), garden (0.1) Flower: 0 → 0.15 Mary: 0.1 → 0.001 Fig. 3. ConceptRank with search based image annotation The objective of image annotation is to associate binary images with descrip- tive metadata that would allow application of text search for content based image retrieval – metadata can also be used for image categorization. Such task has a long tradition in the machine learning field, which approaches the problem by training statistical models for prespecified set of categories. State-of-the-art clas- sification methods of this sort achieve very high accuracy, but their utilization is costly in terms of learning time and requires large amounts of reliably-labeled training data [21]. The proposed strategy, called the concept-rank [2], combines information provided by efficient and effective similarity search with semantic information obtained from linguistic resources. Specifically, we first select a set of initial candidate keywords from descriptions of similar already annotated images and give them a probability score proportional to their frequency and the simi- larity of the respective images to the annotated (query) image. Next, we search for links between these keywords using several semantic relationships defined by the WordNet lexical database, in particular we apply the hypernymy, hyponymy, meronymy, and holonymy. We also include new related keywords in considera- tion for annotation. After the identification of semantic relationships, we run a random-walk-based algorithm over the graph of candidate keywords and their relationships to determine the final probabilities of individual candidates. All the process is illustrated in Fig. 3. 2.3 Stream Processing The problem of fast executing streams of similarity queries is serious for a large number of applications. For example when annotating large collections of im- ages or when publish-subscribe systems consider incoming documents as queries and test them against user profiles to access the level of agreement between the profile and the processed document (image). The acceleration strategy is based on a pragmatic observation that in large collections the response time to pro- cess two random queries is significantly higher that the time needed to process two queries representing similar images [9]. The more similar the queries the shorter the query execution time because a larger portion of the database can be pre-cached from a previous query execution. Then by a clever reordering of queries so that the similarity of two consequent queries is as high as possible, the throughput can significantly be improved. The concept is illustrated in Fig. 4. Further performance improvements can be obtained by a parallel execution for example in cloud computer platforms [8]. Fig. 4. Similarity query stream processing 2.4 Similarity Searching in Motion Capture Data Motion capture data is a good example of complex unstructured data. This spatio-temporal data digitally represents human movements in form of 3D tra- jectories of tracked human body joints. With the recent advances and availabil- ity of motion capturing technologies, there is a strong requirement for intelligent management of such data, which has a great potential to be utilized in a number of applications. To make the content-based management of motion data possible, effective features need to be extracted from 3D skeleton sequences. We propose a 4,096- dimensional vector representation that preserves significant characteristics of original motion sequences [17]. As illustrated in Fig. 5, this representation nor- malizes and transforms an input sequence into a 2D motion image which is then processed by a convolutional neural network. The last hidden layer of the net- work is considered as the output feature. The extracted features demonstrate very convenient properties of being (1) of a fixed size, (2) efficiently compa- rable by the Euclidean distance, and (3) tolerant to a considerable degree of segmentation errors, which is particularly useful for sub-sequence matching. RGB CUBE DCNN normalize visualize POSE-BY-POSE extract <…, 0.53, 1.18, 9.64, …> MOTION NORMALIZED MOTION MOTION IMAGE 4,096D FEATURE VECTOR Fig. 5. Feature extraction from 3D human motion sequences. The extracted features can describe the content of relatively short skeleton sequences taking in order of seconds. This is mainly suitable for recognizing the class of semantic actions using k-nearest neighbor classifiers [20]. In case input sequences are long, the partitioning principle needs to be applied to identify short segments that are better describable by the features. We propose to partition the input sequence into segments of different sizes in a way that an arbitrary sub-sequence overlaps with at least one segment in the majority of frames [16]. This property enables to consider a short query as a single segment and perform efficient sub-sequence searching on a large scale [18]. A similar idea is also used for real-time annotation of pseudo-infinite skeleton sequences [5]. Acknowledgements. This research is supported by the Czech Science Foun- dation project No. GA19-02033S. References 1. Batko, M., Novak, D., Falchi, F., Zezula, P.: Scalability comparison of peer-to-peer similarity search structures. Future Generation Comp. Syst. 24(8), 834–848 (2008) 2. Budı́ková, P., Batko, M., Zezula, P.: Conceptrank for search-based image anno- tation. Multimedia Tools Appl. 77(7), 8847–8882 (2018), https://doi.org/10. 1007/s11042-017-4777-8 3. Chávez, E., Navarro, G., Baeza-Yates, R., Marroquı́n, J.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001) 4. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: VLDB. pp. 426–435. Morgan Kaufmann (1997) 5. Elias, P., Sedmidubsky, J., Zezula, P.: A real-time annotation of motion data streams. In: 19th International Symposium on Multimedia. pp. 154–161. IEEE Computer Society (2017) 6. Hjaltason, G., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003) 7. Krumhansl, C.: Concerning applicability of geometric models to similarity data: The interrelationship between similarity and spatial density. Psychological Review 85 pp. 445–461 (1978) 8. Nalepa, F., Batko, M., Zezula, P.: Model for performance analysis of distributed stream processing applications. In: Database and Expert Systems Applications - 26th International Conference, DEXA 2015, Valencia, Spain, September 1-4, 2015, Proceedings, Part II. pp. 520–533 (2015) 9. Nalepa, F., Batko, M., Zezula, P.: Enhancing similarity search throughput by dy- namic query reordering. In: Database and Expert Systems Applications - 27th International Conference, DEXA 2016, Porto, Portugal, September 5 – 8. p. 15 (2016) 10. Novak, D., Batko, M., Zezula, P.: Generic similarity search engine demonstrated by an image retrieval application. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, July 19-23. p. 840 (2009) 11. Novak, D., Batko, M., Zezula, P.: Large-scale image retrieval using neural net descriptors. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015. pp. 1039–1040 (2015) 12. Novak, D., Batko, M., Zezula, P.: Large-scale similarity data management with distributed metric index. Inf. Process. Manage. 48(5), 855–872 (2012) 13. Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric spaces. In: Third International Workshop on Similarity Search and Applications, SISAP 2010, 18-19 September 2010, Istanbul, Turkey. pp. 59–66 (2010) 14. O’Searcoid, M.: Metric Spaces. Springer (2006) 15. Samet, H.: Foundations of Multidimensional And Metric Data Structures. Series in Data Management Systems, Morgan Kaufmann (2006) 16. Sedmidubsky, J., Elias, P., Zezula, P.: Similarity searching in long sequences of mo- tion capture data. In: 9th International Conference on Similarity Search and Ap- plications (SISAP). pp. 271–285. Springer International Publishing, Cham (2016) 17. Sedmidubsky, J., Elias, P., Zezula, P.: Effective and efficient similarity searching in motion capture data. Multimedia Tools and Applications 77(10), 12073–12094 (2018), https://doi.org/10.1007/s11042-017-4859-7 18. Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Information Systems 80, 148–158 (2019) 19. Sedmidubsky, J., Mic, V., Zezula, P.: 8th International Conference on Similarity Search and Applications (SISAP 2015), chap. Face Image Retrieval Revisited, pp. 204–216. Springer International Publishing (2015), http://dx.doi.org/10.1007/ 978-3-319-25087-8_19 20. Sedmidubsky, J., Zezula, P.: Probabilistic classification of skeleton sequences. In: 29th International Conference on Database and Expert Systems Applications (DEXA). pp. 50–65. Springer International Publishing, Cham (2018) 21. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. pp. 1–9 (2015) 22. Torgerson, W.S.: Multidimensional scaling of similarity. Psychometrika 30 pp. 379– 393 (1965) 23. Tversky, A.: Features of similarity. Psychological Review 84 pp. 327–354 (1977) 24. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach, Advances in Database Systems, vol. 32. Springer (2006)