=Paper= {{Paper |id=Vol-2909/paper6 |storemode=property |title=A Multimodal Approach for Semantic Patent Image Retrieval |pdfUrl=https://ceur-ws.org/Vol-2909/paper6.pdf |volume=Vol-2909 |authors=Kader Pustu-Iren,Gerrit Bruns,Ralph Ewerth }} ==A Multimodal Approach for Semantic Patent Image Retrieval== https://ceur-ws.org/Vol-2909/paper6.pdf
    A Multimodal Approach for Semantic Patent Image Retrieval
                Kader Pustu-Iren                                           Gerrit Bruns                                      Ralph Ewerth∗
    TIB – Leibniz Information Centre for                    TIB – Leibniz Information Centre for                 TIB – Leibniz Information Centre for
          Science and Technology                                  Science and Technology                               Science and Technology
            Hannover, Germany                                       Hannover, Germany                                    Hannover, Germany
            kader.pustu@tib.eu                                      gerrit.bruns@tib.eu                                  ralph.ewerth@tib.eu

ABSTRACT                                                          way. In this context, a survey with patent experts confirms the
Patent images such as technical drawings contain valuable infor-  importance of illustrations in their high informative value and the
mation and are frequently used by experts to compare patents.     demand for an image-based search [8]. Moreover, with the con-
However, current approaches to patent information retrieval are   tinuous refinement of already patented research, the terminology
largely focused on textual information. Consequently, we review   used changes [3], making it more difficult to find corresponding
previous work on patent retrieval with a focus on illustrations inpatents. This problem is exacerbated when cross-linguistic searches
figures. In this paper, we report on work in progress for a novel are conducted. Therefore, illustrations provide an alternative way
approach for patent image retrieval that uses deep multimodal fea-to enable the identification of relevant results in patents, regard-
tures. Scene text spotting and optical character recognition are  less of language and terminology. The use of illustrations is also
employed to extract numerals from an image to subsequently iden-  advantageous for domain and patent class independent searches.
tify references to corresponding sentences in the patent document.In this way, intellectual property (IP) rights can be evaluated for
                                                                  further application domains, which is only possible to a limited
Furthermore, we use a neural state-of-the-art CLIP model to extract
                                                                  extent with a purely textual search. This is especially relevant for
structural features from illustrations and additionally derive textual
features from the related patent text using a sentence transformerbasic and technical patents, whose scope of application is often not
model. To fuse our multimodal features for similarity search we   clear at the beginning of the creation of an exploitation strategy.
apply re-ranking according to averaged or maximum scores. In our     In this paper, we present a novel multimodal system for semantic
experiments, we compare the impact of different modalities on the patent image retrieval in a query-by-example scenario. To extract
                                                                  visually relevant features from images, pre-trained embeddings us-
task of similarity search for patent images. The experimental results
suggest that patent image retrieval can be successfully performed ing deep neural networks are used. Furthermore, scene text spotting
                                                                  is applied in order to extract numerals from the images and map
using the proposed feature sets, while the best results are achieved
when combining the features of both modalities.                   them to their mentions in the patent text. Next, we derive textual
                                                                  features from the relevant sentences in the text utilizing sentence
CCS CONCEPTS                                                      transformers. Finally, textual and visual features are used to index
                                                                  the represented illustrations. Experimental results are presented for
• Information systems → Image search; Content analysis and
                                                                  semantic image search investigating both unimodal and multimodal
feature selection; • Computing methodologies → Visual content-
                                                                  feature sets.
based indexing and retrieval; Image representations.
                                                                     The rest of the paper is organized as follows. We review related
                                                                  work in Section 2. Section 3 introduces the proposed approach
KEYWORDS
                                                                  for multimodal patent image search. We provide an experimental
Patent Image Similarity Search, Deep Learning, Mulitmodal Feature evaluation of the proposed solution in Section 4 and conclude the
Representations, Scene Text Spotting                              paper with a short discussion of results in Section 5.

1    INTRODUCTION
Patent experts and researchers often encounter language and ter-                          2   RELATED WORK
minology barriers when conducting searches to identify research                           Previous approaches to patent information retrieval have been
or patent gaps, (newly) emerging technology developments, or to                           largely limited to textual information [19]. However, terminology
check the patentability of research results. Existing patent retrieval                    in patents changes continuously due to the constant evolution
methods are primarily based on textual searches and largely exclude                       of the presented content and is inconsistent for this reason [3].
illustrations and the relationship between text and image. Often,                         Often, innovative terminology is "invented" along with the actual
however, the innovation of a patent can be identified with the help                       invention. One result of this evolution is that search results are often
of an illustration, and patents with similar or related innovations                       incomplete and do not display all relevant patents. The (additional)
can be quickly analysed by looking at illustrations in a comparative                      evaluation of non-textual information in the form of illustrations,
∗ Also with L3S Research Center, Leibniz University Hannover, Germany.                    such as technical drawings, graphs and diagrams, can facilitate and
                                                                                          significantly improve the search for similar or relevant patents. In
                                                                                          addition, references to the relevant text passages are often given in
PatentSemTech, July 15th, 2021, online                                                    numerical form in these illustrations, so that automatic recognition
© 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)        of these image-text references can also significantly improve the
                                                                                          quality of the (multimodal) search results.
                                                                                     45
PatentSemTech, July 15th, 2021, online                                                                    Kader Pustu-Iren, Gerrit Bruns, and Ralph Ewerth




                                         Figure 1: Proposed system for multimodal patent image retrieval.


   The more general problem of searching in image databases (im-              retrieval results were achieved by late fusion of textual and non-
age retrieval) has been intensively researched in the last decades.           textual results. Bhatti and Hanbury [3] provide an overview of
Simpler methods for search in image databases are usually based on            further research regarding specific figure types (photo, flow chart,
so-called low-level features, which technical descriptions of shape,          technical drawings, diagrams, graphs) that may also be relevant
color, or texture. However, results based on such features very often         for patent retrieval. However, to date no integrated patent retrieval
do not meet the search needs of users, which are mostly of a content          system exploiting multimodal search does exist. The representation
or semantic nature ("semantic gap") [21]. In recent years, signifi-           and quality of the images in patents as well as their schematic and
cant progress has been made to automatically recognize content in             sketchy character require specific approaches or the recognition of
images (denoted as object recognition or visual concept detection)            special objects that are particularly relevant in patents.
[22], especially through deep learning approaches [5, 10, 30]. In
this way, search queries of a content-related nature can be more              3     MULTIMODAL PATENT IMAGE SEARCH
accurately answered.                                                          We now discuss the proposed system that incorporates multimodal
   An important aspect of the presented approach is the similarity            patent features to establish a similarity search based on illustrations.
search that follows feature extraction. Current similarity search             Figure 1 illustrates the individual steps. First, we extract visual and
approaches learn compact codes to replace images [18, 27, 28]. The            textual features (Section 3.1, 3.2) from the patent images. Then,
compact codes usually compress high-dimensional features of a                 based on each modality an index of corresponding image feature
Convolutional Neural Network (CNN) trained on specific datasets               vectors is built (Section 3.3). Finally, the most similar results to a
suitable for the given task. However, these methods are not opti-             query image can be retrieved by re-ranking results based on both
mized for the technical and schematic illustrations in patents, so            indexes.
there is a need for research and development in this area.
   So far, there are relatively few specific approaches for searching
visual information in patents [29]. An example is the Patmedia
                                                                              3.1    Image Feature Extraction
method for similarity search [25], extensions of this [20, 23, 24], or        Patent images are a special category of images that have sketch-
other approaches for concept-based graphical search [11, 13]. These           like characteristics. They usually consist of technical drawings,
methods generally extract textual and visual low-level features from          diagrams, or graphs and are mostly black and white. While smaller
patent images and train detectors that identify a limited number              details can often be of great relevance for interpretation, they often
of predefined concepts. Experiments of these works show that the              also contain redundant patterns. To represent these kind of images,
combination of visual and textual features works best for the task            features are extracted using deep neural network. We use the Con-
of concept detection. More recent approaches [9, 14] establish the            trastive Language-Image Pre-training (CLIP) [16] model that was
references of figures and related text passages using an automatic            trained on a multimodal dataset of 400 million image-text-pairs
detection of the corresponding numerical referencing in the figures.          collected from the internet. The CLIP model is aimed at learning
Another approach [4] uses SIFT-like local histograms as features              visual concepts from natural language supervision and is primarily
and represents the images in patents using Fisher vectors. In the             designed for flexible zero-shot computer vision classification on
experiments based on the 2011 CLEF-IP evaluation [15], the best               arbitrary image datasets by providing simple textual image descrip-
                                                                              tions. This powerful approach has improved the state of the art
                                                                              on several benchmark datasets including ImageNet Sketch [26],
                                                                         46
A Multimodal Approach for Semantic Patent Image Retrieval                                                           PatentSemTech, July 15th, 2021, online


                                                                               vector representations for sentences. We use a RoBERTa [12] model
                                                                               that was pre-trained to produce semantically meaningful sentence
                                                                               embeddings (accordingly to Sentence-BERT[17]) and optimized
                                                                               for semantic textual similarity (STS) in the English language. We
                                                                               embed all the sentences found in the previous image-text mapping
                                                                               step. Finally, an average vector over all related sentences is created
                                                                               to represent an patent image.

                                                                               3.3    Similarity Search
                                                                               Based on the extracted feature representations, indexes are built
                                                                               using the FAISS library[7]. An index is based on product quanti-
                                                                               zation [6] and allows for the efficient comparisons between query
                                                                               vectors and stored vectors based on cosine similarity and returns
                                                                               nearest neighbors. We built separate indexes for both the image and
                                                                               textual feature modalities based on a dataset comprised of 30, 379
                                                                               patent images. Subsequently, the nearest neighbors of a query im-
                                                                               age can be retrieved by similarity search based a) on the stored
                                                                               visual features, b) on the stored textual features, or c) on the basis
          Figure 2: Image-text relations through OCR.                          of a combination of ranking results of both indexes. For the last
                                                                               option we explore two different re-ranking approaches. The first
                                                                               one is based on averaging the resulting similarity scores of each
which contains sketch images with characteristics similar to patent            modality, whereas in the second strategy the final ranking is based
images. This motivates us to utilize CLIP embeddings for the task of           on reordering according to maximum scores.
patent image similarity search. In particular, we use the pre-trained
vision transformer (ViT-B/32) to extract visual features and embed             4     EVALUATION AND DISCUSSION
the patent images.                                                             In this section, the patent image retrieval approaches are evaluated
                                                                               according to the experimental setup in Section 4.2) and based on a
3.2     Textual Feature Extraction                                             predefined patent collection (Section 4.1). We discuss outcomes of
Patent figures usually contain image text, particularly numbers that           the experiments in Section 4.3.
can be used to link illustrated concepts to a description in the patent
document. To use these textual descriptions, we first apply scene              4.1    Patent Dataset
text spotting methods (Section 3.2.1). After relevant sentences have           We conduct our retrieval experiments on a patent collection from
been identified, they are embedded using sentence transformers                 the European Patent Office (EPO) focusing on the exemplary fields
(Section 3.2.2).                                                               of autonomous driving and wind power. To this end, we download
                                                                               patents from the time period 2007 to 2020 and ensure that each
3.2.1 Image-Text Relations using OCR. Optical Character Recogni-
                                                                               patent contains an XML file to parse the structured text and image
tion (OCR) aims to recognize characters in images. Recently, scene
                                                                               information. After excluding formulas, our final patent collection
text recognition methods based on neural networks have emerged.
                                                                               comprises 2, 858 patent documents with a total of 30, 379 figures
We use a two-step approach in which we first detect text blocks
                                                                               of technical drawings, diagrams and graphs. Analogously, another
and then recognize the text they contain. For scene text detection,
                                                                               3, 770 images from 300 patent documents are reserved as test data.
the CRAFT (Character Region Awareness For Text detection) [2]
model for character-level text detection is applied. Subsequently,
a four-stage deep scene text recognition (STR) framework [1] is
                                                                               4.2    Experiments
employed to extract the text. Although these methods were trained              The performance of our system is evaluated using the average
for recognizing text in real-world scenes, they prove to be very               precision (AP) score which is the most commonly used quality
accurate on patent images, for which text detection and recogni-               measure for retrieval approaches. The AP score is calculated from
tion is generally easier than for scene text. Once the image text              a list of ranked documents as follows:
is extracted, we keep the numbers and prune irrelevant text. The
                                                                                                          Õ
                                                                                                    AP =      (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛                  (1)
numbers are then used to identify the relevant sentences in the                                             𝑛
XML file of the corresponding patent document. For this purpose,
we tokenize the text, search for the numbers and keep all matching             where 𝑅𝑛 and 𝑃𝑛 are the precision and recall at the 𝑛 th threshold.
sentences that provide a description for the illustrated concepts.             In general, AP is the average of the precision scores at each rele-
Exemplary text mappings resulting from the scene text recognition              vant document. To evaluate the overall performance, the mean AP
can be seen in Figure 2.                                                       (mAP) score is calculated by taking the mean value of the AP scores
                                                                               across different queries. To verify the performance of our system,
3.2.2 Sentence Transformers. Sentence transformer neural net-                  we randomly selected 20 query images along with their descrip-
works were recently introduced and can be used to compute dense                tions (described in Section 3.2.1) from the test data and evaluated
                                                                          47
PatentSemTech, July 15th, 2021, online                                                                          Kader Pustu-Iren, Gerrit Bruns, and Ralph Ewerth


Table 1: mAP results up to rank 50 for randomly chosen                        framework to conduct patent search. Thereby, we intent to fuse
queries. Re-ranking (avg) denotes the averaging of the dif-                   features by exploiting multimodal machine learning architectures.
ferent modalities’ scores. Re-ranking (max) denotes the re-
ordering according to maximum scores.                                         ACKNOWLEDGEMENTS
                                                                              We would like to sincerely thank the reviewers for their valuable and
                                                   Re-ranking
         Textual Features        Visual Features                              comprehensive comments. This work is financially supported by
                                                   max      avg               the Federal Ministry of Education and Research (BMBF, Bundesmin-
                0.683                    0.696     0.703   0.715              isterium für Bildung und Forschung, project reference 01IO2004A).

                                                                              REFERENCES
AP scores for the textual retrieval, visual retrieval and combined             [1] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han,
                                                                                   Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With
retrieval based on re-ranking. To evaluate the relevance of an re-                 Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In
trieval results we rely on the annotator assessment (done by one                   2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul,
                                                                                   Korea (South), October 27 - November 2, 2019. IEEE, 4714–4722. https://doi.org/
of the authors). Using the additional figure descriptions assists in               10.1109/ICCV.2019.00481
evaluating the relevance of retrieval results to the query image. The          [2] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019.
ranked retrieval lists are evaluated for the top-50 ranks using the                Character Region Awareness for Text Detection. In IEEE Conference on Computer
                                                                                   Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019.
AP score (AP@50).                                                                  Computer Vision Foundation / IEEE, 9365–9374. https://doi.org/10.1109/CVPR.
                                                                                   2019.00959
                                                                               [3] Naeem Bhatti and Allan Hanbury. 2013. Image search in patents: a review.
4.3     Discussion                                                                 International Journal on Document Analysis and Recognition 16, 4 (2013), 309–329.
The results of our experiments are shown in Table 1. Using only vi-                https://doi.org/10.1007/s10032-012-0197-5
                                                                               [4] Gabriela Csurka, Jean-Michel Renders, and Guillaume Jacquet. 2011. XRCE’s
sual features for image retrieval yields a slightly higher mAP score               Participation at Patent Image Classification and Image-based Patent Retrieval
of 0.696 compared to using textual features. The combination of                    Tasks of the Clef-IP 2011. In CLEF 2011 Labs and Workshop, Notebook Papers,
both modalities yields the highest mAP score of 0.715 when scores                  19-22 September 2011, Amsterdam, The Netherlands (CEUR Workshop Proceedings,
                                                                                   Vol. 1177), Vivien Petras, Pamela Forner, and Paul D. Clough (Eds.). CEUR-WS.org.
of the textual and visual similarity search are averaged. Reordering               http://ceur-ws.org/Vol-1177/CLEF2011wn-CLEF-IP-CsurkaEt2011.pdf
the similarity values according to the maximum scores for both fea-            [5] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger.
                                                                                   2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on
ture sets had a smaller effect on the similarity search performance.               Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,
The results suggest that combining both modalities can help in-                    2017. IEEE Computer Society, 2261–2269. https://doi.org/10.1109/CVPR.2017.243
crease the quality of retrieval results. In general, results based on          [6] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization
                                                                                   for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine
visual features were easier to annotate since the visual embeddings                Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
retrieve mostly visually similar results. It should also be noted that         [7] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity
results based on textual features were harder to inspect and thus                  search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734 http://arxiv.
                                                                                   org/abs/1702.08734
annotated with the additional help of the sentences representing               [8] Hideo Joho, Leif Azzopardi, and Wim Vanderbauwhede. 2010. A survey of
the retrieved image. In general, it was observed that textual fea-                 patent users: an analysis of tasks, behavior, search functionality and system
                                                                                   requirements. In Information Interaction in Context Symposium, IIiX 2010, New
tures retrieved semantically relevant images. Thus, the combination                Brunswick, NJ, USA, August 18-21, 2010, Nicholas J. Belkin and Diane Kelly (Eds.).
of both feature representations presents a good mixture of both                    ACM, 13–24. https://doi.org/10.1145/1840784.1840789
visually and semantically related patent images.                               [9] R. Kramer and U. Döring. 2016. CLEF-IP 2011: Tool zur Unterstützung der bil-
                                                                                   dorientierten Selektion von Patentdokumenten am Beispiel des XPAT Patent
                                                                                   Viewers. In Big Data - Chancen und Herausforderungen. 38. Kolloquium der Tech-
5     CONCLUSIONS                                                                  nischen Universität Ilmenau über Patentinformation und gewerblichen Rechtsschutz.
                                                                                   Proceedings. PATINFO. 209–2019.
The discussion of related work for patent image retrieval revealed            [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet
that existing work is either outdated or insufficient when it comes                Classification with Deep Convolutional Neural Networks. In Advances in
                                                                                   Neural Information Processing Systems 25: 26th Annual Conference on Neural
to exploiting the multimodal information that patents provide. In                  Information Processing Systems 2012. Proceedings of a meeting held Decem-
this paper, we have presented a framework that exploits multimodal                 ber 3-6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando
                                                                                   C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Wein-
features to enable semantic patent image search. Image-text rela-                  berger (Eds.). 1106–1114.       https://proceedings.neurips.cc/paper/2012/hash/
tions are identified through scene text spotting and OCR yielding                  c399862d3b9d6b76c8436e924a68c45b-Abstract.html
a mapping of in-figure numbers to the corresponding text. This                [11] Dimitris Liparas, Anastasia Moumtzidou, Stefanos Vrochidis, and Ioannis Kompat-
                                                                                   siaris. 2014. Concept-oriented labelling of patent images based on Random Forests
allowed us to embed relevant text passages in feature vector rep-                  and proximity-driven generation of synthetic data. In Proceedings of the Third
resentations. Additionally, we successfully embedded the shape                     Workshop on Vision and Language, VL@COLING 2014, Dublin, Ireland, August 23,
and topological information in images using powerful deep neural                   2014, Anja Belz, Darren Cosker, Frank Keller, William Smith, Kalina Bontcheva,
                                                                                   Sien Moens, and Alan F. Smeaton (Eds.). Dublin City University and the Associa-
networks. We exploit both textual and image features in order to fa-               tion for Computational Linguistics, 25–32. https://doi.org/10.3115/v1/W14-5404
cilitate semantic similarity for patent images. Experimental results          [12] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
                                                                                   Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
demonstrated the feasibility of the approach, while suggesting that                Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
the combination of both modalities is beneficial.                                  arXiv:1907.11692 http://arxiv.org/abs/1907.11692
    In the future, we plan to exploit further information in images           [13] Hui Ni, Zhenhua Guo, and Biqing Huang. 2015. Binary Patent Image Retrieval
                                                                                   Using the Hierarchical Oriented Gradient Histogram. In International Confer-
such as non-numeric image text. Moreover, we plan to incorporate                   ence on Service Science, ICSS 2015, Weihai, Shandong, China, May 8-9, 2015. IEEE
multimodal information in an end-to-end network and have a joint                   Computer Society, 23–27. https://doi.org/10.1109/ICSS.2015.12
                                                                         48
A Multimodal Approach for Semantic Patent Image Retrieval                                                                                    PatentSemTech, July 15th, 2021, online


[14] Jeong Beom Park, Thomas Mandl, and Do Wan Kim. 2017. Patent Document                       [23] Stefanos Vrochidis, Anastasia Moumtzidou, and Ioannis Kompatsiaris. 2012.
     Similarity Based on Image Analysis Using the SIFT-Algorithm and OCR-Text.                       Concept-based patent image retrieval. World Patent Information 34 (2012), 292–
     International Journal of Contents 13(4) (2017), 70–79.                                          303.
[15] Florina Piroi, Mihai Lupu, Allan Hanbury, and Veronika Zenz. 2011. CLEF-                   [24] Stefanos Vrochidis, Anastasia Moumtzidou, and Ioannis Kompatsiaris. 2014. En-
     IP 2011: Retrieval in the Intellectual Property Domain. In CLEF 2011 Labs and                   hancing Patent Search with Content-Based Image Retrieval. In Professional Search
     Workshop, Notebook Papers, 19-22 September 2011, Amsterdam, The Netherlands                     in the Modern World - COST Action IC1002 on Multilingual and Multifaceted In-
     (CEUR Workshop Proceedings, Vol. 1177), Vivien Petras, Pamela Forner, and Paul D.               teractive Information Access, Georgios Paltoglou, Fernando Loizides, and Preben
     Clough (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-1177/CLEF2011wn-CLEF-                        Hansen (Eds.). Lecture Notes in Computer Science, Vol. 8830. Springer, 250–273.
     IP-PiroiEt2011.pdf                                                                              https://doi.org/10.1007/978-3-319-12511-4_12
[16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,                    [25] Stefanos Vrochidis, S. Papadopoulos, Anastasia Moumtzidou, Panagiotis
     Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,                     Sidiropoulos, Emanuelle Pianta, and Ioannis Kompatsiaris. 2010. Towards content-
     Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual                        based patent image retrieval: A framework perspective. World Patent Information
     Models From Natural Language Supervision. CoRR abs/2103.00020 (2021).                           32 (2010), 94–106.
     arXiv:2103.00020 https://arxiv.org/abs/2103.00020                                          [26] Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. 2019. Learn-
[17] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-                           ing Robust Global Representations by Penalizing Local Predictive Power.
     dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference                        In Advances in Neural Information Processing Systems 32: Annual Confer-
     on Empirical Methods in Natural Language Processing and the 9th International                   ence on Neural Information Processing Systems 2019, NeurIPS 2019, Decem-
     Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong                        ber 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle,
     Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and                      Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar-
     Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990.                       nett (Eds.). 10506–10518.      https://proceedings.neurips.cc/paper/2019/hash/
     https://doi.org/10.18653/v1/D19-1410                                                            3eefceb8087e964f89c2d59e8a249915-Abstract.html
[18] Josiane Rodrigues, Marco Cristo, and Juan G Colonna. 2020. Deep hashing for                [27] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2015. Learning to hash
     multi-label image retrieval: a survey. Artificial Intelligence Review 53, 7 (2020),             for indexing big data - A survey. Proc. IEEE 104, 1 (2015), 34–57.
     5261–5307.                                                                                 [28] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on
[19] Walid Shalaby and Wlodek Zadrozny. 2019. Patent retrieval: a literature review.                 learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence
     Knowl. Inf. Syst. 61, 2 (2019), 631–660. https://doi.org/10.1007/s10115-018-1322-7              40, 4 (2017), 769–790.
[20] Panagiotis Sidiropoulos, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2011.               [29] Liping Yang, Ming Gong, and Vijayan K. Asari. 2020. Diagram Image Retrieval
     Content-based binary image retrieval using the adaptive hierarchical density                    and Analysis: Challenges and Opportunities. In 2020 IEEE/CVF Conference on
     histogram. Pattern Recognition 44, 4 (2011), 739–750. https://doi.org/10.1016/j.                Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA,
     patcog.2010.09.014                                                                              June 14-19, 2020. IEEE, 685–698. https://doi.org/10.1109/CVPRW50498.2020.00098
[21] Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and                   [30] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learn-
     Ramesh Jain. 2000. Content-based Image Retrieval at the End of the Early Years.                 ing Transferable Architectures for Scalable Image Recognition. In 2018 IEEE
     IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000),                   Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake
     1349–1380.                                                                                      City, UT, USA, June 18-22, 2018. IEEE Computer Society, 8697–8710. https:
[22] Cees G. M. Snoek and Arnold W. M. Smeulders. 2010. Visual-Concept Search                        //doi.org/10.1109/CVPR.2018.00907
     Solved? Computer 43, 6 (2010), 76–78. https://doi.org/10.1109/MC.2010.183




                                                                                           49