<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PatentExplorer: Refining Patent Search with Domain-specific Topic Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Buckley</string-name>
          <email>mark.buckley@siemens.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sophia Althammer∗</string-name>
          <email>sophia.althammer@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arber Qoku∗†</string-name>
          <email>arber.qoku@dkfz-heidelberg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>∗Work done while at Siemens AG.</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Cancer Consortium (DKTK)</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siemens AG</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>TU Vienna</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Also with German Cancer Research Center (DKFZ)</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany.</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>50</fpage>
      <lpage>54</lpage>
      <abstract>
        <p>Practitioners in the patent domain require high recall search solutions with precise results to be found in a large search space. Traditional search solutions focus on retrieving semantically similar documents, however we reason that the diferent topics in a patent document should be taken into account for search. In this paper we present PatentExplorer, an in-use system for patent search, which empowers users to explore diferent topics of semantically similar patents and refine the search by filtering by these topics. PatentExplorer uses similarity search to first retrieve patents for a list of patent IDs or given patent text and then ofers the ability to refine the search results by their diferent topics using topic models trained on the domains in which our users are active.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The ever-increasing volume [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and linguistic complexity of
published patent documents mean that searching for both high
precision and high recall results for a given information need is a
challenging problem. Practitioners in the patent domain require
search results of high quality [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], as they provide the input to
processes such as infringement litigation or freedom-to-operate
clearing [
        <xref ref-type="bibr" rid="ref15 ref23">15, 23</xref>
        ]. The use of machine learning and deep learning
methods for patent analysis is a vibrant research area [
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ] with
application in technology forecasting, patent retrieval [
        <xref ref-type="bibr" rid="ref19 ref4">4, 19</xref>
        ], patent
text generation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or litigation analysis. There has been much
research on the patent domain language which shows that the
sections in patents constitute diferent genres depending on their legal
or technical purpose [
        <xref ref-type="bibr" rid="ref20 ref23">20, 23</xref>
        ]. We reason that patents consist of
diferent topics contained in the diferent sections of the document.
The example in Figure 1 shows how a patent in the field of
database systems can include topics such as physical storage of data
or search interfaces—for a given patent search goal one of these
could be relevant while the other is not. In industrial settings it is
additionally important that search tools are particularly sensitive
to individual companies’ domains of interest, thereby improving
the quality of search results.
      </p>
      <p>A real time database system configured to store database
content. . .
such that the replicas of each partition are contained on
diferent physical storage units. . .
wherein the system provides an interface for user
searches for documents types including video, audio. . .
To provide an efective patent search tool under these
conditions we present PatentExplorer, an in-use system for patent search,
which empowers the users to explore diferent topics in search
results and refine the results by their topics. PatentExplorer uses
similarity search for first stage retrieval and domain-specific topic
modelling for refinement of the search results. We propose topic
modelling for search refinement because it is typical that a patent
document will deal with multiple related but orthogonal subjects.
For a particular information need, some but not all of these will be
relevant. Therefore we combine a document level analysis
(similarity) with a sub-document level analysis (topic models) for patent
search. The intention is that the user can retrieve a large set of
semantically related patents and inspect the topic distributions of
the most similar ones. In order to refine the results the user can
apply filters on specific topics, thereby increasing the task-specific
relevance of the most highly ranked results.</p>
      <p>This paper presents the design and user interface of the in-use
web application which implements this idea as well as the technical
description. The system has been designed with a particular user
persona in mind. The intended user is a patent search professional,
and therefore is familiar with patent search tools and also has deep
knowledge of existing patent search methodologies, such as boolean
retrieval and category filtering, as well as having broad technical
knowledge of the relevant industrial domains.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In this section we give some background about related work on
patent search tools, furthermore we introduce the methods for
similarity search and topic models which we employ in PatentExplorer.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        Patent search holds several domain-specific challenges for
information retrieval [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Furthermore serving the specific use-case setting
of practitioners in a company requires company-specific
adaptation of the search solution. Diferent techniques and approaches
have been explored to improve and refine the search results in the
patent domain, ranging from query expansion [
        <xref ref-type="bibr" rid="ref16 ref2 ref25">2, 16, 25</xref>
        ] to term
selection [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For prior art retrieval in the CLEF-IP workshop [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
Verma and Varma [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] demonstrate high retrieval performance by
representing a patent document by its IPC classes and computing
similarity of patents based on the IPC classes. For patent search
tools, mainly the challenge of high coverage of all published patents
is addressed with an federated approach [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] or a single access point
via text editor [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
2.2
      </p>
      <p>Methods
2.2.1 Similarity search. Similarity search is a method for retrieval
where for a given query document, a ranked list of semantically
relevant documents is computed, as shown in Figure 2. The
general approach is to first embed the query document into a vector
representation which encodes its semantics. This representation
is then compared to the equivalent representations for each of the
known documents in the search index. The results are then sorted
by similarity score and the highest ranking results are presented to
the user. The similarity function is usually cosine similarity.</p>
      <p>
        The crucial step is to find an embedding which computes a
suitable document representation. Diferent representations have been
used in previous research, for instance tf-idf weighted sparse
representations, latent semantic indexing, or contextualised document
embeddings, for instance computed by a BERT model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Despite the semantic richness of contextualised document
embeddings, sparse representations have been found to be competitive
in large scale retrieval scenarios [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We employ tf-idf weighted
sparse representation in PatentExplorer for retrieving similar patents
in the first stage. Large scale retrieval needs to use eficient indexing,
such as algorithms for approximate nearest neighbour search [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
to avoid computing the cosine similarity scores for every document
in the search space. Therefore we employ approximate nearest
neighbor search on the sparse representations in PatentExplorer.
2.2.2 Topic models. Topic models help to understand the internal
structure of large text data sets by summarising the themes which
occur in the documents [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Topic modelling is an unsupervised
approach (ie no labelled data is required) and can be applied to any
domain. The only assumptions are the distributional hypothesis,
that the frequency of occurrence of words and phrases is a good
reflection of the strength and prevalence of themes, and the
assumption that in general documents are a mixture of several topics. The
topic modelling process begins by converting a set of documents
into a sparse term-document matrix  containing weighted
feature frequencies for each document. The topic modelling algorithm
transforms this matrix into a pair of matrices  and  such that
 ≈  × 
 , the term-topic matrix, encodes the weight of each feature with
respect to the topics and , the document-topic matrix, contains a
latent representation for each document showing which topics it
belongs to.
      </p>
      <p>
        We consider two algorithms for topic modelling in this work,
latent Dirichlet allocation (LDA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and non-negative matrix
factorisation (NMF) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. LDA is a generative model which treats
documents as a distribution over topics and topics as a distribution
over words. NMF is a method for decomposing large matrices of
non-negative values into the product of smaller matrices, in this
case into the matrices  and . In each case the topic distributions
(the rows of the matrix ) can be interpreted as a document
representation and thus can be compared and analysed. The index of
the largest value of each row of  is interpreted as the most likely
topic for that document. Topic modelling has previously been used
in the patent domain, for instance for technology forecasting [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>PATENTEXPLORER</title>
      <p>In this section we first show the user interface of PatentExplorer
and give some implementation details about the architecture, the
data and the similarity and topic models being employed in
PatentExplorer.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>User interface</title>
      <p>The user interaction begins with the submission of a list of patent
IDs (accession numbers) or the text of a patent, as shown in Figure 3.
The system retrieves the text of the patents given in the list of
patent IDs and creates a local copy of the text content of each of the
patents. How many of the patents in the list are found in the index
is indicated with "Dataset contains - documents". The user can then
submit the "Dataset" to the system to retrieve similar documents
based on the similarity search.</p>
      <p>For each of the similar documents, the system also computes
their topic distribution. The distribution is displayed along with the
accession number and similarity score between the query patent
and each similar patent, as shown in Figure 4. The most highly
weighted words for each topic, drawn from the matrix  , are
displayed by hovering over the bars. The figure also shows the filter
function which the system provides to re-rank the search results
according to their topics. Both positive and negative filters can be
applied. Positive filters lead to matching documents being lifted to
the top of the ranked list, negative filters lead to the matching
documents being discarded from the results set. For both filter types, a
list of topics can be specified in the text field on the left hand side,
as well as two slider values. The two slider values restrict when the
iflter will match: a document matches if at least one of the chosen
topics has a weight in the topic distribution of that document of at
least “min probability”. The default value is 0.1. With a max rank
of  , the filter will also only match if the chosen topic is among
the  most highly weighted topics in the distribution for that
document. So if the query document is the example in Figure 1, the
user could inspect the topic distribution to find, for instance, the
topic concerning physical storage, and apply a negative filter to
remove it, leaving those results which have more to do with user
search. Finally, when the user is finished applying filters to the
search results, the results set can be downloaded as tabular data,
preserving the filtered order and including similarity scores.
3.2</p>
      <p>Technical implementation
3.2.1 Data. To prepare the components of our system we collected
two overlapping data sets. The source is a commercially provided
database of patent abstracts in which patents from patent ofices
worldwide have been translated into a consistent, English-language
form. We chose this data source in order to achieve maximum
uniformity of the input data, however PatentExplorer makes no
strong assumptions about the content of the documents, and would
also work on publicly available patent data. The Our-Portfolio data
set contains the patents whose assignee is our company or its
subsidiaries. It contains 73k documents. We filtered this data set
to only contain patents filed since 2010, resulting in a set of 36k
documents. The All-Patents data set is the collection of all patents
published between 2014 and 2020, which contains approximately
15 million documents. For both data sets we extract the title and
abstract of the patents.
3.2.2 Architecture. The architecture of the system is shown in
Figure 5. The two main components are the similarity search and
the topic model. Each component ofers an API with one function:
“get-similar-ids” and “get-topic-distribution”, respectively. The
“getsimilar-ids” function receives one or more patent IDs and retrieves
the most similar documents from the search index, defined as the
cosine similarity between their representations. This is equivalent
to finding the nearest neighbours of the query document in the
representation space. The “get-topic-distribution” receives a single
patent ID and computes the topic distribution for that document
from the previously trained topic model. The search index and the
topic model are static resources which are not changed during run
time. Both components retrieve the patent document content from
the database “Patent documents” directly as required, so that the
user must only supply document IDs.
3.2.3 Training the topic model. The topic model is trained on the
Our-Portfolio data set. The documents were preprocessed to remove
approximately 50 patent-specific stop words, such as “invention” or
“apparatus”, as well as usual English stop words. We performed
stemming and then extracted all n-grams for  = 1, 2, 3, 4 to construct
the term-document matrix. We discarded words which occurred in
fewer than 10 documents or in more than 40% of the documents.</p>
      <p>
        In preliminary experiments we used a coherence metric to
investigate the optimal parameters for the topic model. In recent
years, several approaches to measure coherence have been
developed based on distributional properties of word pairs over a set of
words [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ], which mostly difer in the pairwise scoring metric
being used. A typical choice is pointwise mutual information (PMI),
which measures the strength of association between words in a
data set within windows of a given size.
      </p>
      <p>
        We use the coherence score   as proposed by Aletras and
Stevenson [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. An  -dimensional context vector is created for each
word  , whose elements are the normalised PMI values of  with
each of the other top words of the topic. Each word  is then
assigned the cosine similarity of its context vector and the sum of
the other context vectors. The coherence score of the topic is the
average of all of these cosine similarities.
      </p>
      <p>
        To investigate which parametrisation of topic modelling works
best for patent text we took a sample of 513k English-language
patents from those published in 2010. We removed duplicates and
documents which were either very long or very short, leaving a set
of approximately 255k documents. As we show in Table 1, both LDA
and NMF exhibit similar performance on this data set, as measured
by   , with NMF discovering marginally better topics. We find
upon manual inspection that NMF is more robust across a wide
range of number of topics. We therefore choose NMF to implement
the system. We finally use NMF with 75 topics to train the topic
model for the system on the Our-Portfolio data set.
3.2.4 Compiling the search index. To compile the search index we
must first compute an embedding for each document in the search
space. We use latent semantic indexing (LSI) to compute the
document vectors, which is the result of tf-idf vectorisation followed
by SVD compression [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Rather than computing the tf-idf weights
from the entire All-Patents data set, we instead compute the tf-idf
weights from the Our-Portfolio data set, so that each document
embedding in the search space will encode information which is
relevant to our industrial domains. We then apply an SVD
compression into 200 dimensions in order to reduce the size of each
document vector and therefore the size of the overall search index.
We use the resulting LSI projection function to compute a document
embedding for each of the 15m documents in the All-Patents data
set.
      </p>
      <p>To implement the lookup of documents given a query
document we use Annoy1, a library which provides approximate nearest
neighbour search. Each document embedding is normalised before
insertion so that the cosine similarity can be computed with the dot
product function. The similarity component of the system provides
an endpoint which returns the IDs of the  most similar documents
for some query document and some .</p>
    </sec>
    <sec id="sec-6">
      <title>4 CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we present PatentExplorer, an in-use system for patent
search. PatentExplorer gives users the ability to retrieve similar
patents given a list of patent IDs or the patent text and refine their
search results depending on the diferent topics of the patents. The
topic models are tailored to the domain-specific topics of a company
operating in the technical domain.</p>
      <p>Tailoring the search representation and topic models to our
domains turned out in initial user testing to ofer mixed results.
Feedback from patent search experts indicates that while the
system can deliver relevant results within our domains, outside of
these domains it can return results with few or no relevant
documents among the ten highest ranked results. While building and
testing our system we have found that the requirements of patent
search use cases place high demands on the accuracy of dedicated
search tools. In order to reduce the latency of the similarity search
to an acceptable level we were forced to simplify the similarity
computation, using a compressed tf-idf representation where a
contextualised document embedding may well have produced better
results. It is also crucial to provide full coverage: The dataset of
patents which the system contains goes back to 2014, however
for prior art searches, all previously published patents should be
discoverable. Finally the need to update the search index
continuously leads to considerable recurring computational load and data
management tasks—this is not yet provided for.</p>
      <p>Our future work to improve the system will include expanding
the system architecture to eficiently handle a larger number of
documents in the search space. In the longer term we intend to
investigate introducing more appropriate document
representations to be used in the search index, for instance by using a large
language model such as BERT, or by learning the representations
via a supervised auxiliary task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] [n.d.]. U.S. Patent Statistics Chart. https://www.uspto.gov/web/ofices/ac/ido/ oeip/taf/us_stat.htm. Accessed:
          <fpage>2021</fpage>
          -06-04.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bashar</given-names>
            <surname>Al-Shboul and</surname>
          </string-name>
          Sung-Hyon
          <string-name>
            <surname>Myaeng</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Query Phrase Expansion Using Wikipedia in Patent Class Search</article-title>
          . In Information Retrieval Technology, Mohamed Vall Mohamed Salem, Khaled Shaalan, Farhad Oroumchian, Azadeh Shakery, and Halim Khelalfa (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,
          <fpage>115</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Nikolaos</given-names>
            <surname>Aletras</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Stevenson</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Evaluating Topic Coherence Using Distributional Semantics</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Computational Semantics (IWCS</source>
          <year>2013</year>
          )
          <article-title>- Long Papers</article-title>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Potsdam, Germany,
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          . https://www.aclweb.org/ anthology/W13-0102
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Althammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study</article-title>
          .
          <source>In Advances in Information Retrieval, 43rd European Conference on IR Research</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Leonidas</given-names>
            <surname>Aristodemou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Tietze</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The state-of-the-art on Intellectual Property Analytics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data</article-title>
          .
          <source>World Patent Information</source>
          <volume>55</volume>
          (12
          <year>2018</year>
          ),
          <fpage>37</fpage>
          -
          <lpage>51</lpage>
          . https://doi.org/10.1016/j.wpi.
          <year>2018</year>
          .
          <volume>07</volume>
          .002
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>David</surname>
            <given-names>M Blei</given-names>
          </string-name>
          , Andrew Y Ng, and
          <string-name>
            <given-names>Michael I</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3</source>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          (
          <year>2003</year>
          ),
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Manajit</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          , David Zimmermann,
          <string-name>
            <given-names>and Fabio</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>PatentQuest: A User-Oriented Tool for Integrated Patent Search</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 43rd European Conference on Information Retrieval (ECIR</source>
          <year>2021</year>
          ), Lucca,
          <source>Italy (online only)</source>
          ,
          <source>April 1st</source>
          ,
          <source>2021 (CEUR Workshop Proceedings</source>
          , Vol.
          <volume>2847</volume>
          ), Ingo Frommholz, Philipp Mayr, Guillaume Cabanac, and Suzan Verberne (Eds.).
          <source>CEUR-WS.org</source>
          ,
          <volume>89</volume>
          -
          <fpage>101</fpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2847</volume>
          /paper-09.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Deerwester</surname>
          </string-name>
          , Susan T Dumais, George W Furnas,
          <article-title>Thomas K Landauer,</article-title>
          and Richard Harshman.
          <year>1990</year>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American society for information science 41</source>
          ,
          <issue>6</issue>
          (
          <year>1990</year>
          ),
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Mona</given-names>
            <surname>Golestan Far</surname>
          </string-name>
          , Scott Sanner, Mohamed Reda Bouadjenek, Gabriela Ferraro, and
          <string-name>
            <given-names>David</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>On Term Selection Techniques for Patent Prior Art Search</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (Santiago, Chile) (SIGIR '15)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>803</fpage>
          -
          <lpage>806</lpage>
          . https: //doi.org/10.1145/2766462.2767801
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jef</surname>
            <given-names>Johnson</given-names>
          </string-name>
          , Matthijs Douze, and
          <string-name>
            <given-names>Hervé</given-names>
            <surname>Jégou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Billion-scale similarity search with GPUs</article-title>
          .
          <source>IEEE Transactions on Big Data</source>
          (
          <year>2019</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . https://doi.org/10. 1109/TBDATA.
          <year>2019</year>
          .2921572
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ralf</surname>
            <given-names>Krestel</given-names>
          </string-name>
          , Renukswamy Chikkamath, Christoph Hewel, and
          <string-name>
            <given-names>Julian</given-names>
            <surname>Risch</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>A survey on deep learning for patent analysis</article-title>
          .
          <source>World Patent Information</source>
          <volume>65</volume>
          (6
          <year>2021</year>
          ). https://doi.org/10.1016/j.wpi.
          <year>2021</year>
          .102035
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PatentTransformer-2: Controlling Patent Text Generation by Structural Metadata</article-title>
          . arXiv:
          <year>2001</year>
          .
          <article-title>03708 [cs</article-title>
          .CL]
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yi</surname>
            <given-names>Luan</given-names>
          </string-name>
          , Jacob Eisenstein, Kristina Toutanova, and Michael Collins.
          <year>2020</year>
          . Sparse, Dense, and
          <article-title>Attentional Representations for Text Retrieval</article-title>
          . CoRR abs/
          <year>2005</year>
          .00181 (
          <year>2020</year>
          ). arXiv:
          <year>2005</year>
          .00181 https://arxiv.org/abs/
          <year>2005</year>
          .00181
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Lupu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <string-name>
            <given-names>Patent</given-names>
            <surname>Retrieval</surname>
          </string-name>
          .
          <source>Foundations and Trends® in Information Retrieval 7</source>
          ,
          <issue>1</issue>
          (
          <year>2013</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>97</lpage>
          . https://doi.org/10.1561/1500000027
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Walid</given-names>
            <surname>Magdy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gareth J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A Study on Query Expansion Methods for Patent Retrieval</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Patent Information Retrieval (Glasgow</source>
          , Scotland, UK) (
          <source>PaIR '11)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          . https://doi.org/10.1145/2064975.2064982
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>David</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hanna M Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Edmund</given-names>
            <surname>Talley</surname>
          </string-name>
          , Miriam Leenders, and
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Optimizing semantic coherence in topic models</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics</source>
          ,
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>David</given-names>
            <surname>Newman</surname>
          </string-name>
          , Jey Han Lau,
          <string-name>
            <given-names>Karl</given-names>
            <surname>Grieser</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Automatic evaluation of topic coherence</article-title>
          .
          <source>In Human Language Technologies</source>
          :
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Florina</surname>
            <given-names>Piroi</given-names>
          </string-name>
          , Mihai Lupu, and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Overview of CLEF-IP 2013 Lab</article-title>
          . In Information Access Evaluation. Multilinguality, Multimodality, and
          <string-name>
            <surname>Visualization</surname>
          </string-name>
          , Pamela Forner, Henning Müller, Roberto Paredes, Paolo Rosso, and Benno Stein (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,
          <fpage>232</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Julian</given-names>
            <surname>Risch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Krestel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Domain-specific word embeddings for patent classification</article-title>
          .
          <source>Data Technol. Appl</source>
          .
          <volume>53</volume>
          (
          <year>2019</year>
          ),
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Tony</given-names>
            <surname>Russell-Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jon</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Information retrieval in the workplace: A comparison of professional search practices</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>54</volume>
          (
          <issue>11</issue>
          <year>2018</year>
          ),
          <fpage>1042</fpage>
          -
          <lpage>1057</lpage>
          . Issue 6. https: //doi.org/10.1016/j.ipm.
          <year>2018</year>
          .
          <volume>07</volume>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Mike</given-names>
            <surname>Salampasis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>PerFedPat: An integrated federated system for patent search</article-title>
          .
          <source>World Patent Information</source>
          <volume>38</volume>
          (09
          <year>2014</year>
          ). https://doi.org/ 10.1016/j.wpi.
          <year>2014</year>
          .
          <volume>08</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Walid</given-names>
            <surname>Shalaby</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wlodek</given-names>
            <surname>Zadrozny</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Patent retrieval : a literature review</article-title>
          .
          <source>Knowledge and Information Systems</source>
          (
          <year>2019</year>
          ). https://doi.org/10.1007/s10115-018- 1322-7
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Rashish</given-names>
            <surname>Tandon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Suvrit</given-names>
            <surname>Sra</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Sparse nonnegative matrix approximation: new formulations and algorithms</article-title>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Wolfgang</surname>
            <given-names>Tannebaum</given-names>
          </string-name>
          , Parvaz Mahdabi, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Rauber</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Efect of Log-Based Query Term Expansion on Retrieval Efectiveness in Patent Searching</article-title>
          , Vol.
          <volume>9283</volume>
          .
          <fpage>300</fpage>
          -
          <lpage>305</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -24027-5_
          <fpage>32</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Manisha</given-names>
            <surname>Verma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vasudeva</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search</article-title>
          ., Vol.
          <volume>1177</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>