A Multimodal Approach for Semantic Patent Image Retrieval Kader Pustu-Iren Gerrit Bruns Ralph Ewerth∗ TIB – Leibniz Information Centre for TIB – Leibniz Information Centre for TIB – Leibniz Information Centre for Science and Technology Science and Technology Science and Technology Hannover, Germany Hannover, Germany Hannover, Germany kader.pustu@tib.eu gerrit.bruns@tib.eu ralph.ewerth@tib.eu ABSTRACT way. In this context, a survey with patent experts confirms the Patent images such as technical drawings contain valuable infor- importance of illustrations in their high informative value and the mation and are frequently used by experts to compare patents. demand for an image-based search [8]. Moreover, with the con- However, current approaches to patent information retrieval are tinuous refinement of already patented research, the terminology largely focused on textual information. Consequently, we review used changes [3], making it more difficult to find corresponding previous work on patent retrieval with a focus on illustrations inpatents. This problem is exacerbated when cross-linguistic searches figures. In this paper, we report on work in progress for a novel are conducted. Therefore, illustrations provide an alternative way approach for patent image retrieval that uses deep multimodal fea-to enable the identification of relevant results in patents, regard- tures. Scene text spotting and optical character recognition are less of language and terminology. The use of illustrations is also employed to extract numerals from an image to subsequently iden- advantageous for domain and patent class independent searches. tify references to corresponding sentences in the patent document.In this way, intellectual property (IP) rights can be evaluated for further application domains, which is only possible to a limited Furthermore, we use a neural state-of-the-art CLIP model to extract extent with a purely textual search. This is especially relevant for structural features from illustrations and additionally derive textual features from the related patent text using a sentence transformerbasic and technical patents, whose scope of application is often not model. To fuse our multimodal features for similarity search we clear at the beginning of the creation of an exploitation strategy. apply re-ranking according to averaged or maximum scores. In our In this paper, we present a novel multimodal system for semantic experiments, we compare the impact of different modalities on the patent image retrieval in a query-by-example scenario. To extract visually relevant features from images, pre-trained embeddings us- task of similarity search for patent images. The experimental results suggest that patent image retrieval can be successfully performed ing deep neural networks are used. Furthermore, scene text spotting is applied in order to extract numerals from the images and map using the proposed feature sets, while the best results are achieved when combining the features of both modalities. them to their mentions in the patent text. Next, we derive textual features from the relevant sentences in the text utilizing sentence CCS CONCEPTS transformers. Finally, textual and visual features are used to index the represented illustrations. Experimental results are presented for • Information systems → Image search; Content analysis and semantic image search investigating both unimodal and multimodal feature selection; • Computing methodologies → Visual content- feature sets. based indexing and retrieval; Image representations. The rest of the paper is organized as follows. We review related work in Section 2. Section 3 introduces the proposed approach KEYWORDS for multimodal patent image search. We provide an experimental Patent Image Similarity Search, Deep Learning, Mulitmodal Feature evaluation of the proposed solution in Section 4 and conclude the Representations, Scene Text Spotting paper with a short discussion of results in Section 5. 1 INTRODUCTION Patent experts and researchers often encounter language and ter- 2 RELATED WORK minology barriers when conducting searches to identify research Previous approaches to patent information retrieval have been or patent gaps, (newly) emerging technology developments, or to largely limited to textual information [19]. However, terminology check the patentability of research results. Existing patent retrieval in patents changes continuously due to the constant evolution methods are primarily based on textual searches and largely exclude of the presented content and is inconsistent for this reason [3]. illustrations and the relationship between text and image. Often, Often, innovative terminology is "invented" along with the actual however, the innovation of a patent can be identified with the help invention. One result of this evolution is that search results are often of an illustration, and patents with similar or related innovations incomplete and do not display all relevant patents. The (additional) can be quickly analysed by looking at illustrations in a comparative evaluation of non-textual information in the form of illustrations, ∗ Also with L3S Research Center, Leibniz University Hannover, Germany. such as technical drawings, graphs and diagrams, can facilitate and significantly improve the search for similar or relevant patents. In addition, references to the relevant text passages are often given in PatentSemTech, July 15th, 2021, online numerical form in these illustrations, so that automatic recognition © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) of these image-text references can also significantly improve the quality of the (multimodal) search results. 45 PatentSemTech, July 15th, 2021, online Kader Pustu-Iren, Gerrit Bruns, and Ralph Ewerth Figure 1: Proposed system for multimodal patent image retrieval. The more general problem of searching in image databases (im- retrieval results were achieved by late fusion of textual and non- age retrieval) has been intensively researched in the last decades. textual results. Bhatti and Hanbury [3] provide an overview of Simpler methods for search in image databases are usually based on further research regarding specific figure types (photo, flow chart, so-called low-level features, which technical descriptions of shape, technical drawings, diagrams, graphs) that may also be relevant color, or texture. However, results based on such features very often for patent retrieval. However, to date no integrated patent retrieval do not meet the search needs of users, which are mostly of a content system exploiting multimodal search does exist. The representation or semantic nature ("semantic gap") [21]. In recent years, signifi- and quality of the images in patents as well as their schematic and cant progress has been made to automatically recognize content in sketchy character require specific approaches or the recognition of images (denoted as object recognition or visual concept detection) special objects that are particularly relevant in patents. [22], especially through deep learning approaches [5, 10, 30]. In this way, search queries of a content-related nature can be more 3 MULTIMODAL PATENT IMAGE SEARCH accurately answered. We now discuss the proposed system that incorporates multimodal An important aspect of the presented approach is the similarity patent features to establish a similarity search based on illustrations. search that follows feature extraction. Current similarity search Figure 1 illustrates the individual steps. First, we extract visual and approaches learn compact codes to replace images [18, 27, 28]. The textual features (Section 3.1, 3.2) from the patent images. Then, compact codes usually compress high-dimensional features of a based on each modality an index of corresponding image feature Convolutional Neural Network (CNN) trained on specific datasets vectors is built (Section 3.3). Finally, the most similar results to a suitable for the given task. However, these methods are not opti- query image can be retrieved by re-ranking results based on both mized for the technical and schematic illustrations in patents, so indexes. there is a need for research and development in this area. So far, there are relatively few specific approaches for searching visual information in patents [29]. An example is the Patmedia 3.1 Image Feature Extraction method for similarity search [25], extensions of this [20, 23, 24], or Patent images are a special category of images that have sketch- other approaches for concept-based graphical search [11, 13]. These like characteristics. They usually consist of technical drawings, methods generally extract textual and visual low-level features from diagrams, or graphs and are mostly black and white. While smaller patent images and train detectors that identify a limited number details can often be of great relevance for interpretation, they often of predefined concepts. Experiments of these works show that the also contain redundant patterns. To represent these kind of images, combination of visual and textual features works best for the task features are extracted using deep neural network. We use the Con- of concept detection. More recent approaches [9, 14] establish the trastive Language-Image Pre-training (CLIP) [16] model that was references of figures and related text passages using an automatic trained on a multimodal dataset of 400 million image-text-pairs detection of the corresponding numerical referencing in the figures. collected from the internet. The CLIP model is aimed at learning Another approach [4] uses SIFT-like local histograms as features visual concepts from natural language supervision and is primarily and represents the images in patents using Fisher vectors. In the designed for flexible zero-shot computer vision classification on experiments based on the 2011 CLEF-IP evaluation [15], the best arbitrary image datasets by providing simple textual image descrip- tions. This powerful approach has improved the state of the art on several benchmark datasets including ImageNet Sketch [26], 46 A Multimodal Approach for Semantic Patent Image Retrieval PatentSemTech, July 15th, 2021, online vector representations for sentences. We use a RoBERTa [12] model that was pre-trained to produce semantically meaningful sentence embeddings (accordingly to Sentence-BERT[17]) and optimized for semantic textual similarity (STS) in the English language. We embed all the sentences found in the previous image-text mapping step. Finally, an average vector over all related sentences is created to represent an patent image. 3.3 Similarity Search Based on the extracted feature representations, indexes are built using the FAISS library[7]. An index is based on product quanti- zation [6] and allows for the efficient comparisons between query vectors and stored vectors based on cosine similarity and returns nearest neighbors. We built separate indexes for both the image and textual feature modalities based on a dataset comprised of 30, 379 patent images. Subsequently, the nearest neighbors of a query im- age can be retrieved by similarity search based a) on the stored visual features, b) on the stored textual features, or c) on the basis Figure 2: Image-text relations through OCR. of a combination of ranking results of both indexes. For the last option we explore two different re-ranking approaches. The first one is based on averaging the resulting similarity scores of each which contains sketch images with characteristics similar to patent modality, whereas in the second strategy the final ranking is based images. This motivates us to utilize CLIP embeddings for the task of on reordering according to maximum scores. patent image similarity search. In particular, we use the pre-trained vision transformer (ViT-B/32) to extract visual features and embed 4 EVALUATION AND DISCUSSION the patent images. In this section, the patent image retrieval approaches are evaluated according to the experimental setup in Section 4.2) and based on a 3.2 Textual Feature Extraction predefined patent collection (Section 4.1). We discuss outcomes of Patent figures usually contain image text, particularly numbers that the experiments in Section 4.3. can be used to link illustrated concepts to a description in the patent document. To use these textual descriptions, we first apply scene 4.1 Patent Dataset text spotting methods (Section 3.2.1). After relevant sentences have We conduct our retrieval experiments on a patent collection from been identified, they are embedded using sentence transformers the European Patent Office (EPO) focusing on the exemplary fields (Section 3.2.2). of autonomous driving and wind power. To this end, we download patents from the time period 2007 to 2020 and ensure that each 3.2.1 Image-Text Relations using OCR. Optical Character Recogni- patent contains an XML file to parse the structured text and image tion (OCR) aims to recognize characters in images. Recently, scene information. After excluding formulas, our final patent collection text recognition methods based on neural networks have emerged. comprises 2, 858 patent documents with a total of 30, 379 figures We use a two-step approach in which we first detect text blocks of technical drawings, diagrams and graphs. Analogously, another and then recognize the text they contain. For scene text detection, 3, 770 images from 300 patent documents are reserved as test data. the CRAFT (Character Region Awareness For Text detection) [2] model for character-level text detection is applied. Subsequently, a four-stage deep scene text recognition (STR) framework [1] is 4.2 Experiments employed to extract the text. Although these methods were trained The performance of our system is evaluated using the average for recognizing text in real-world scenes, they prove to be very precision (AP) score which is the most commonly used quality accurate on patent images, for which text detection and recogni- measure for retrieval approaches. The AP score is calculated from tion is generally easier than for scene text. Once the image text a list of ranked documents as follows: is extracted, we keep the numbers and prune irrelevant text. The Õ AP = (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛 (1) numbers are then used to identify the relevant sentences in the 𝑛 XML file of the corresponding patent document. For this purpose, we tokenize the text, search for the numbers and keep all matching where 𝑅𝑛 and 𝑃𝑛 are the precision and recall at the 𝑛 th threshold. sentences that provide a description for the illustrated concepts. In general, AP is the average of the precision scores at each rele- Exemplary text mappings resulting from the scene text recognition vant document. To evaluate the overall performance, the mean AP can be seen in Figure 2. (mAP) score is calculated by taking the mean value of the AP scores across different queries. To verify the performance of our system, 3.2.2 Sentence Transformers. Sentence transformer neural net- we randomly selected 20 query images along with their descrip- works were recently introduced and can be used to compute dense tions (described in Section 3.2.1) from the test data and evaluated 47 PatentSemTech, July 15th, 2021, online Kader Pustu-Iren, Gerrit Bruns, and Ralph Ewerth Table 1: mAP results up to rank 50 for randomly chosen framework to conduct patent search. Thereby, we intent to fuse queries. Re-ranking (avg) denotes the averaging of the dif- features by exploiting multimodal machine learning architectures. ferent modalities’ scores. Re-ranking (max) denotes the re- ordering according to maximum scores. ACKNOWLEDGEMENTS We would like to sincerely thank the reviewers for their valuable and Re-ranking Textual Features Visual Features comprehensive comments. This work is financially supported by max avg the Federal Ministry of Education and Research (BMBF, Bundesmin- 0.683 0.696 0.703 0.715 isterium für Bildung und Forschung, project reference 01IO2004A). REFERENCES AP scores for the textual retrieval, visual retrieval and combined [1] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With retrieval based on re-ranking. To evaluate the relevance of an re- Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In trieval results we rely on the annotator assessment (done by one 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4714–4722. https://doi.org/ of the authors). Using the additional figure descriptions assists in 10.1109/ICCV.2019.00481 evaluating the relevance of retrieval results to the query image. The [2] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. ranked retrieval lists are evaluated for the top-50 ranks using the Character Region Awareness for Text Detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. AP score (AP@50). Computer Vision Foundation / IEEE, 9365–9374. https://doi.org/10.1109/CVPR. 2019.00959 [3] Naeem Bhatti and Allan Hanbury. 2013. Image search in patents: a review. 4.3 Discussion International Journal on Document Analysis and Recognition 16, 4 (2013), 309–329. The results of our experiments are shown in Table 1. Using only vi- https://doi.org/10.1007/s10032-012-0197-5 [4] Gabriela Csurka, Jean-Michel Renders, and Guillaume Jacquet. 2011. XRCE’s sual features for image retrieval yields a slightly higher mAP score Participation at Patent Image Classification and Image-based Patent Retrieval of 0.696 compared to using textual features. The combination of Tasks of the Clef-IP 2011. In CLEF 2011 Labs and Workshop, Notebook Papers, both modalities yields the highest mAP score of 0.715 when scores 19-22 September 2011, Amsterdam, The Netherlands (CEUR Workshop Proceedings, Vol. 1177), Vivien Petras, Pamela Forner, and Paul D. Clough (Eds.). CEUR-WS.org. of the textual and visual similarity search are averaged. Reordering http://ceur-ws.org/Vol-1177/CLEF2011wn-CLEF-IP-CsurkaEt2011.pdf the similarity values according to the maximum scores for both fea- [5] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on ture sets had a smaller effect on the similarity search performance. Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, The results suggest that combining both modalities can help in- 2017. IEEE Computer Society, 2261–2269. https://doi.org/10.1109/CVPR.2017.243 crease the quality of retrieval results. In general, results based on [6] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine visual features were easier to annotate since the visual embeddings Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57 retrieve mostly visually similar results. It should also be noted that [7] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity results based on textual features were harder to inspect and thus search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734 http://arxiv. org/abs/1702.08734 annotated with the additional help of the sentences representing [8] Hideo Joho, Leif Azzopardi, and Wim Vanderbauwhede. 2010. A survey of the retrieved image. In general, it was observed that textual fea- patent users: an analysis of tasks, behavior, search functionality and system requirements. In Information Interaction in Context Symposium, IIiX 2010, New tures retrieved semantically relevant images. Thus, the combination Brunswick, NJ, USA, August 18-21, 2010, Nicholas J. Belkin and Diane Kelly (Eds.). of both feature representations presents a good mixture of both ACM, 13–24. https://doi.org/10.1145/1840784.1840789 visually and semantically related patent images. [9] R. Kramer and U. Döring. 2016. CLEF-IP 2011: Tool zur Unterstützung der bil- dorientierten Selektion von Patentdokumenten am Beispiel des XPAT Patent Viewers. In Big Data - Chancen und Herausforderungen. 38. Kolloquium der Tech- 5 CONCLUSIONS nischen Universität Ilmenau über Patentinformation und gewerblichen Rechtsschutz. Proceedings. PATINFO. 209–2019. The discussion of related work for patent image retrieval revealed [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet that existing work is either outdated or insufficient when it comes Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural to exploiting the multimodal information that patents provide. In Information Processing Systems 2012. Proceedings of a meeting held Decem- this paper, we have presented a framework that exploits multimodal ber 3-6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Wein- features to enable semantic patent image search. Image-text rela- berger (Eds.). 1106–1114. https://proceedings.neurips.cc/paper/2012/hash/ tions are identified through scene text spotting and OCR yielding c399862d3b9d6b76c8436e924a68c45b-Abstract.html a mapping of in-figure numbers to the corresponding text. This [11] Dimitris Liparas, Anastasia Moumtzidou, Stefanos Vrochidis, and Ioannis Kompat- siaris. 2014. Concept-oriented labelling of patent images based on Random Forests allowed us to embed relevant text passages in feature vector rep- and proximity-driven generation of synthetic data. In Proceedings of the Third resentations. Additionally, we successfully embedded the shape Workshop on Vision and Language, VL@COLING 2014, Dublin, Ireland, August 23, and topological information in images using powerful deep neural 2014, Anja Belz, Darren Cosker, Frank Keller, William Smith, Kalina Bontcheva, Sien Moens, and Alan F. Smeaton (Eds.). Dublin City University and the Associa- networks. We exploit both textual and image features in order to fa- tion for Computational Linguistics, 25–32. https://doi.org/10.3115/v1/W14-5404 cilitate semantic similarity for patent images. Experimental results [12] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A demonstrated the feasibility of the approach, while suggesting that Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). the combination of both modalities is beneficial. arXiv:1907.11692 http://arxiv.org/abs/1907.11692 In the future, we plan to exploit further information in images [13] Hui Ni, Zhenhua Guo, and Biqing Huang. 2015. Binary Patent Image Retrieval Using the Hierarchical Oriented Gradient Histogram. In International Confer- such as non-numeric image text. Moreover, we plan to incorporate ence on Service Science, ICSS 2015, Weihai, Shandong, China, May 8-9, 2015. IEEE multimodal information in an end-to-end network and have a joint Computer Society, 23–27. https://doi.org/10.1109/ICSS.2015.12 48 A Multimodal Approach for Semantic Patent Image Retrieval PatentSemTech, July 15th, 2021, online [14] Jeong Beom Park, Thomas Mandl, and Do Wan Kim. 2017. Patent Document [23] Stefanos Vrochidis, Anastasia Moumtzidou, and Ioannis Kompatsiaris. 2012. Similarity Based on Image Analysis Using the SIFT-Algorithm and OCR-Text. Concept-based patent image retrieval. World Patent Information 34 (2012), 292– International Journal of Contents 13(4) (2017), 70–79. 303. [15] Florina Piroi, Mihai Lupu, Allan Hanbury, and Veronika Zenz. 2011. CLEF- [24] Stefanos Vrochidis, Anastasia Moumtzidou, and Ioannis Kompatsiaris. 2014. En- IP 2011: Retrieval in the Intellectual Property Domain. In CLEF 2011 Labs and hancing Patent Search with Content-Based Image Retrieval. In Professional Search Workshop, Notebook Papers, 19-22 September 2011, Amsterdam, The Netherlands in the Modern World - COST Action IC1002 on Multilingual and Multifaceted In- (CEUR Workshop Proceedings, Vol. 1177), Vivien Petras, Pamela Forner, and Paul D. teractive Information Access, Georgios Paltoglou, Fernando Loizides, and Preben Clough (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-1177/CLEF2011wn-CLEF- Hansen (Eds.). Lecture Notes in Computer Science, Vol. 8830. Springer, 250–273. IP-PiroiEt2011.pdf https://doi.org/10.1007/978-3-319-12511-4_12 [16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, [25] Stefanos Vrochidis, S. Papadopoulos, Anastasia Moumtzidou, Panagiotis Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Sidiropoulos, Emanuelle Pianta, and Ioannis Kompatsiaris. 2010. Towards content- Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual based patent image retrieval: A framework perspective. World Patent Information Models From Natural Language Supervision. CoRR abs/2103.00020 (2021). 32 (2010), 94–106. arXiv:2103.00020 https://arxiv.org/abs/2103.00020 [26] Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. 2019. Learn- [17] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed- ing Robust Global Representations by Penalizing Local Predictive Power. dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference In Advances in Neural Information Processing Systems 32: Annual Confer- on Empirical Methods in Natural Language Processing and the 9th International ence on Neural Information Processing Systems 2019, NeurIPS 2019, Decem- Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong ber 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar- Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. nett (Eds.). 10506–10518. https://proceedings.neurips.cc/paper/2019/hash/ https://doi.org/10.18653/v1/D19-1410 3eefceb8087e964f89c2d59e8a249915-Abstract.html [18] Josiane Rodrigues, Marco Cristo, and Juan G Colonna. 2020. Deep hashing for [27] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2015. Learning to hash multi-label image retrieval: a survey. Artificial Intelligence Review 53, 7 (2020), for indexing big data - A survey. Proc. IEEE 104, 1 (2015), 34–57. 5261–5307. [28] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on [19] Walid Shalaby and Wlodek Zadrozny. 2019. Patent retrieval: a literature review. learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence Knowl. Inf. Syst. 61, 2 (2019), 631–660. https://doi.org/10.1007/s10115-018-1322-7 40, 4 (2017), 769–790. [20] Panagiotis Sidiropoulos, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2011. [29] Liping Yang, Ming Gong, and Vijayan K. Asari. 2020. Diagram Image Retrieval Content-based binary image retrieval using the adaptive hierarchical density and Analysis: Challenges and Opportunities. In 2020 IEEE/CVF Conference on histogram. Pattern Recognition 44, 4 (2011), 739–750. https://doi.org/10.1016/j. Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, patcog.2010.09.014 June 14-19, 2020. IEEE, 685–698. https://doi.org/10.1109/CVPRW50498.2020.00098 [21] Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and [30] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learn- Ramesh Jain. 2000. Content-based Image Retrieval at the End of the Early Years. ing Transferable Architectures for Scalable Image Recognition. In 2018 IEEE IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000), Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake 1349–1380. City, UT, USA, June 18-22, 2018. IEEE Computer Society, 8697–8710. https: [22] Cees G. M. Snoek and Arnold W. M. Smeulders. 2010. Visual-Concept Search //doi.org/10.1109/CVPR.2018.00907 Solved? Computer 43, 6 (2010), 76–78. https://doi.org/10.1109/MC.2010.183 49