Use of Pseudo Relevance Feedback for Patent Clustering
                 with Fuzzy C-means

                               Noushin Fadaei                          Thomas Mandl
                            Hildesheim University                   Hildesheim University
                            Hildesheim, Germany                     Hildesheim, Germany
                          fadaei@uni-hildesheim.de                 mandl@uni-hildesheim.de


                                                                 are growing very fast and required to be more or-
                                                                 ganized for such purposes. According to European
                         Abstract                                Patent Office (EPO), there is a steady increase in this
                                                                 huge information resource in terms of filed patents
    Patent databases are meaningful resources for                since 2010b [Off18]. In 2016, EPO recorded the high-
    technology trend detection as they collect in-               est number of granted patents which went up to 10.1%
    formation on the recent key innovations; how-                in 2017 [Off18]. To manage the detection of subtopics
    ever the importance of wordings in patents                   and their potentiality of being a trend in such big data,
    and use of complex content are remarkable                    further grouping of patents is inevitable.
    challenges in key word extraction in the text                   Patent analysis approaches are either qualitative or
    mining phase. Moreover patents share infor-                  quantitative [Hon09]. The focus of this work lies on
    mation by nature and depending on the crite-                 qualitative patent analysis, i.e., it uses text-mining
    rion of classification such as materials or uses,            techniques regarding the content of patents unlike
    one may belong to multiple classes. For clus-                quantitative approaches which employ metadata. The
    tering patents, this work proposes an updat-                 process typically aims to transform documents to vec-
    ing fuzzy c-means clustering which employs                   tors using weighting systems (e.g., TFIDF) for key
    pseudo relevance feedback originating from in-               terms; Clustering methods then exploit similarity mea-
    formation retrieval in order to improve fea-                 sures to examine related vectors (e.g., Euclidean dis-
    tures extracted from the patent collection fol-              tance) and group them into clusters. The efficiency
    lowing feedbacks. The results show a notice-                 of clustering is highly dependent on the selected key
    able improvement in clustering after applying                terms [TJC04]. While the key words extraction re-
    pseudo relevance feedback clustering patents                 quires the contribution of experts in various domains,
    under topic contact lenses.                                  automatic methods restrict the number of selected
                                                                 terms by applying thresholds for term frequency in
1    Introduction                                                documents (TF) or document frequency (DF) [TLL07]
                                                                 or consider building key phrases for example by creat-
The increasing competition in research conducted at              ing a co-relation matrix of high frequent terms in docu-
universities and other research institutes as well as in         ments where the co-occurrence of terms in the dataset
industry, further intensified by the increasing global-          is reserved [THTL06]. This work adopts pseudo rele-
ization, reinforces the importance of identifying new            vance feedback [MRS08b] to obtain the key terms that
trends at an early stage. According to a study by                tend to form the core concept of clusters.
Thomson Reuters 70% to 90% of the information cov-
                                                                    One of the most commonly used clustering algo-
ered in patents depending on the research area is
                                                                 rithms in text-mining is k-means; however k-means
not published anywhere else [Cor07]. This quality of
                                                                 is not the best clustering method for patent analy-
patents makes such databases vital resources for find-
                                                                 sis. Patents are very rich documents in terms of pro-
ing new trends in the industry. Yet patent databases
                                                                 fessional information and they may cover a range of
Copyright © CIKM 2018 for the individual papers by the papers'   technologies, applications or use of various materials.
authors. Copyright © CIKM 2018 for the volume as a collection
                                                                 Therefore exhaustive clustering confines patents that
                                                                 potentially belong to different classes of a certain crite-
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
rion such as technology or material [FMS+ 15]. More-              to be represented as vectors (X = (x1 , x2 , . . . , xn )2 ≤
over k-means requires the number of clusters and can-             c ≤ n). Each vector xk ∈ Rs is built up by s features
not cope with outliers [CHPT05].                                  where the features are the selected key terms from the
   To avoid low accuracy, fuzzy non-exhaustive clus-              dataset. The goal is to organize X into groups that
tering methods are exploited for patent clustering, e.g.,         may share vectors for further patent analysis purposes
by [THTL06] and [DD09]. The fuzziness of the clus-                like identifying trends in the given topic [HXW+ 12].
tering method provides a likelihood of possession (or
membership to cluster) instead of a rigid distinction             2.1     Citations In Text
and the overlapping characteristic of clustering allows
one patent reflecting a number of claims contributes              Citations within the text should indicate the author’s
in multiple clusters. Setting a threshold for member-             last name and year[?]. Reference style[?] should follow
ship controls to what extent of similarity patents may            the style that you are used to using, as long as the
show up in a cluster. Fuzzy c-means (FCM) [Bez81] is              citation style is consistent.
the most popular fuzzy clustering algorithms and here
we use it to softly partition our patent collection on            3     Our Approach
certain topics. Also, the membership matrix enables
                                                                  3.1     Fuzzy C-Means Algorithm (FCM) [Bez81]
FCM to cope with the issue of the outliers as all mem-
bership values of one document to all clusters ought to           Fuzzy c-means algorithm (FCM) [Bez81] is based on
add up to 1; thus, an unrelated document to all clus-             the fuzzy membership matrix. The membership de-
ters receives insignificant membership values for each            scribes the likelihood of each vector (document) xk
and every cluster and can be ignored.                             being a part of cluster (subtopic) ci where 1 ≤ k ≤ n
   The ultimate goal of this study is to investigate the          and 1 ≤ i ≤ c with c being the number of clusters. The
main idea of partitioning a patent dataset by means of            overall membership of each vector is normalized to 1.
a fuzzy clustering which allows for updating through              The algorithm starts off with initializing c vectors as
relevance feedback. Practically this study will run               centroids of the clusters. Then the membership (wik )
with help of patent expert users however for experi-              is calculated through the Euclidean distances (dik ) of
mental purposes and analysing the validity of the ap-             each vector (xk ) to the centroid (Pi ) of the cluster it
proach, we organize it with FCM and pseudo rele-                  belongs to and to the centroids of other clusters.
vance feedback. For the evaluation, we developed a                                      Pn         m
benchmark based on World Intellectual Property Orga-                                     k=1 (wik ) xk
                                                                                    Pi = P n         m
                                                                                                                          (1)
nization’s patent reports on recent technology trends                                      k=1 (wik )
which were edited by experts in the domains. The
                                                                                           c
reports describe technologies and trends within a do-                               (b)
                                                                                           X            1
main, e.g. contact lenses or robotic arms. These re-                              Wik =                        2
                                                                                                  (dik )( b) m−1
                                                                                                                          (2)
                                                                                           j=1 [( (djk )( b) )   ]
ports are published at the website of the World Intel-
lecutal Property Organization (WIPO) 1 . The selected
reports are provided by Gridlogics Technologies Pvt.                The FCM membership function is calculated as
Ltd and were partially generated by the use of Patent             [HXW+ 12]:
iNSIGHT Pro. The developed benchmark can be used                                          c                        −1
for experiments in classification, clustering, trend anal-                              X      ||xj − vi ||A m−1
                                                                                                               2
                                                                              µ(i,j) = [     (               )   ]        (3)
ysis or other intelligent patent processing systems.                                     t=1
                                                                                               ||xj − vt ||A

2    Problem Statement                                               µ(i,j) represents the membership value of jth patent
                                                                  of the dataset and ith topic whose centroid is vi . ||||A
Given the query of the expert user which determines
                                                                  stands for norm function.
the scope of the topic or more generally given the In-
ternational Patent Classification (IPC) 2 of the topic of
                                                                     The memberships updates centroids vectors until
interest, we retrieve the patents of the domain from the
                                                                  the overall distance of the updated centroids is less
patent dataset. The set of n patents is then supposed
                                                                  than ε compared to the last set of centroids. The
    1 http://www.wipo.int/patentscope/en/programs/patent          steps of FCM algorithm are as follows:
landscapes/plrdb.html
    2 According to World Intellectual Property Organiza-

tion(WIPO), The International Patent Classification (IPC), es-
tablished by the Strasbourg Agreement 1971, provides for a hi-        1. Set the number of clusters to be found (c)
erarchical system of language independent symbols for the clas-
sification of patents.                                                2. Set an Euclidean normalization and fuzziness (m)
 3. Initialize of cluster prototype P 0 , set the iterative                                similarity score to the query. Then the top k docu-
    counter (b)                                                                            ments are considered as the feedback for best results.
                                                                                           Pseudo relevance feedback considers these documents
 4. Obtain membership matrix using Equation 2                                              the source of relevant key terms to the query.
    above                                                                                     We make use of this method in FCM so that we
 5. Update the centroids using Equation 1 above                                            find the patents that are more similar to the core con-
                                                                                           cept of the cluster or centroids. The hypothesis is the
 6. Repeat starting from step 2 until ||P (b) −                                            key terms provided by these patents reflect the main
    P (b−1) || <                                                                          idea of the cluster and their corresponding vectors can
                                                                                           represent a more manageable dataset for FCM. The
   The algorithm description and formulations are in-                                      procedure is shown in figure 1. It generates an over-
spired by [NNB15].                                                                         lapping clustering based on the membership matrix of
                                                                                           FCM. As long as a relevance feedback is requested the
3.2   Updating Fuzzy C-Means using Pseudo                                                  following procedure runs:
      Relevance Feedback
                                                                                               1. Rank patents based on their membership to the
                                                                                                  cluster
               Parameter initialization (number                                                2. Obtain the top relevant patents for building the
               of clusters (c), fuzziness (m) and
                         iterations (b))                                                          features for vectors (k% of the size of the corre-
                                                                                                  sponding cluster)

              Initialize centroids 𝑃0 (update 𝑝𝑖 if                                            3. Update X = (x1 , x2 , . . . , xn )2 ≤ c ≤ n (dataset of
                    called/already initialized)
                                                                                                  vectors) with new feature

                                                                                               4. Pass X to FCM so it starts updating the mem-
              Calculate the membership matrix               X: set of vectors of dataset          bership function
               by updating membership values
                                                            X with reduced dimension:
                             𝑊𝑖𝑘
      NO                                                     update features after RF          5. Stop if after clustering result of FCM no relevance
                                                            with the key terms of only
                                                            top 10% similar vectors to
                                                                                                  feedback (RF) is requested
                                                            the final clusters centroids
                     Minimum Ɛ Or
                   Maximum iteration?                                                      4      Experiment
                      YES                                                                  4.1     Datasets and Experiments Settings
                        Clustering Result                                                  We used freely available queries that are provided
                                                                                           by the World Intellectual Property Organization
                                                      YES                                  (WIPO) to form a gold standard. The result sets
                         Relevance
                                                                                           were gained through the European Patent Fulltext
                         feedback?                                                         (EPFULL) repository. For this work, we used a set,
                   NO
                                                                                           namely contact lenses to be clustered and another
                                                                                           set, robotic arms to show an insight of the built gold
                             STOP
                                                                                           standard. The code was implemented in Python using
                                                                                           Natural Language Toolkit (NLTK) for tokenization
                                                                                           and stemming but for removing the stop words,
Figure 1: Updating fuzzy c-means using pseudo rele-                                        we used our list of stop words for patent analysis.
vance feedback                                                                             For Fuzzy c-means clustering we adopted the code
   Relevance feedback is a common concept in infor-                                        provided in Github 3 . The overlapping is controlled
mation retrieval that involves the user knowledge to                                       by allowing the patents that are at least 99% similar
identify and fetch more similar results to the query.                                      to a member of the cluster inside the cluster as a
Pseudo relevance feedback resembles the same con-                                          member. The fuzzifier is set on 1.2, the error on 0.001
cept but the entire process is carried out automati-                                       and the number of iterations on 200. We run the
cally. The goal is to identify the relevant key terms                                      method for c=3, c=13 and c=23 and for visualization
regarding the query and to expand them in order to                                         in 2D, we have used Principal Component analysis
update the query and make it in line with more rele-                                       (PCA).
vant documents. The procedure starts with retrieving
relevant documents and sorting them based on their                                             3 https://github.com/holtwashere/PossibilisticCMeans
4.2   Evaluation Metrics                                     4.3    Gold Standard based on World Intellec-
                                                                    tual Property Organization (WIPO) Re-
Evaluation measures used for granulated clusters (e.                ports
g. k-means clusters) such as Dunn, Mutual Informa-
                                                             There are a number of technological reports available
tion (MI), F-measure, Rand Index and Jaccard are not
                                                             by WIPO under the universal title of public health/life
quite useful for measuring overlapping clusters; for in-
                                                             science. These reports usually cover two types of
stance Dunn gives a higher score to the clustering sys-
                                                             queries; one results in the main class (usually shares
tems that assign the data points to more distant clus-
                                                             the title with the topic that is reported) and one is
ters while the contents of each cluster are pretty close.
                                                             breaking the main query into pieces; thus produces
Considering a criterion like uses, we know that one
                                                             subclasses. Some reports categorize the foregoing sub-
patent might have several usages and such measures
                                                             classes and provide fewer yet more general subclasses.
are not revealing any required information. Purity ex-
                                                             The following example depicts search strings provided
poses the very nature of a cluster: the degree of consis-
                                                             by WIPO report which results to a main class, namely
tency. The higher is the purity the less is random clus-
                                                             robotic arms followed by a further query that along
tering. This is one important characteristic of clusters
                                                             with the main query leads into a subclass of robotic
in the patent clustering task, nevertheless we cannot
                                                             arms:
neglect the drawbacks of Purity: it is highly dependent
on the number of clusters. Like Purity, MI is also in-
fluenced by the number of clusters, while Normalized
                                                              1. Query to the main class robotic arm:
Mutual Information (NMI) enables us to compare the
clusters with each other (it ranges between 0 and 1)               (FT=(robot* or (artificial w/2 intelligence) or
and it is not affected greatly by the inaccurate number            android or cyborg or humanoid*)) or (TAC=
of clusters. For assessing the quality of this clustering,         (manipulator* or manipulater* or actuator*
we have used Normalized Mutual Information (NMI)                   or actuater* or drives or joint or joints or
by Fred and Jain [FJ03] and Purity [MRS08a].                       actuation or (”end effector” or ”end effecter”)
                                                                   or ((pneumatic* or air) w/2 muscle*))) and
                                                                   ((IC= B25J9/02 or B25J9/04 or B25J9/06 or
                                                                   B25J13/02 or B25J13/08 or B25J17 or B25J18)
                                                                   or (UC=901/2 or 901/14 or 901/19 or 901/27
                                                                   or 901/31 or 901/39 or 700/245 or 700/248 or
                                                                   700/261))


                                                              2. Query to its subclass Anthropomorphic
                                                                 Robot:
                                                                   (TAC) contains (humanoid or android or anthro-
                                                                   pomorphic* or anthropomorfic*)
Figure 2: FCM results on 13 clusters under topic Con-
tact Lenses
                                                                   Query guide: FT-Full Text, TAC- Title Abstract
                                                                   Claim, IC- International Class, UC- US Class.
                                                                   w/2 shows the maximum number of intervening
                                                                   unmatched positions doesn’t exceed 2.


                                                                Depending on the authors of the reports, queries are
                                                             described in different languages and formats. There-
                                                             fore the queries had to be adjusted to match Json
                                                             data type. Using Elasticsearch’s API, we collected
                                                             data through the EPFULL database and obtained
                                                             the main class and subsequently the corresponding
                                                             subclasses. For instance under the topic robotic arms,
Figure 3: FCM and RF results on 13 clusters under            519 patents were retrieved, out of which 511 patents
topic Contact Lenses                                         were covered by subclasses from which 293 hits belong
                                                             to the subclasses of the types criterion, 474 hits fit
                                                             in the subclasses of applications criterion and the
                                                           [CHPT05] Bernard Chen, R. Harrison, Yi Pan, and
Table 1: Evaluation results of clustering of the topic
                                                                    Phang C. Tai. Novel hybrid hierarchical-
Contact Lenses
                                                                    k-means clustering method (h-k-means)
                    NMI                  Purity                     for microarray analysis. In Proceedings
 #Clusters
              FCM FCM&RF FCM FCM&RF                                 of the 2005 IEEE Computational Systems
 3 clusters       0.18       0.22      0.76      0.77               Bioinformatics Conference - Workshops,
 13 clusters      0.05       0.16      0.78      0.77               CSBW ’05, pages 105–108, Washington,
 23 clusters      0.09       0.13      0.76      0.70               DC, USA, 2005. IEEE Computer Society.

subclasses of parts criterion cover 701 hits. Patents      [Cor07]    The Thomson Corporation. Global patent
may share different subclasses and some available                     sources: An overview of international
patents in EPFULL may not be covered by any sub-                      patents. 2007.
class. The results of the proposed clustering method       [DD09]     Türkay Dereli and Alptekin Durmuşoğlu.
have been examined against these retrieved subclasses.                Classifying technology patents to identify
                                                                      trends: Applying a fuzzy-based clustering
                                                                      approach in the turkish textile industry.
5   Results                                                           Technology in Society, 31(3):263 – 272,
                                                                      2009.
The results of clustering under the topic contact lenses
shows using pseudo relevance feedback can definitely       [FJ03]     Ana L N Fred and Anil K. Jain. Ro-
help patents picking the better cluster. According to                 bust data clustering. In Proceedings of
table 1, while the Purity results remain more or less                 the IEEE Computer Society Conference
the same the normalized mutual information (NMI)                      on Computer Vision and Pattern Recog-
improved 4% to 11% for all three number of clusters.                  nition, volume 2, 2003.
Purity is significantly dependent on the number of
clusters as when the data themes are noticeably fewer      [FMS+ 15] Noushin Fadaei, Thomas Mandl, Michael
than the number of clusters the chance of having more                Schwantner, Mustafa Sofean, Julia M.
consistent clusters would raise. However The results of              Struß, Katrin Werner, and Christa
Purity under the topic Contact Lenses do not change                  Womser-Hacker.     Patent analysis and
diversely for different number of clusters and it remain             patent clustering for technology trend
relatively high. This reflects the acceptable ability of             mining. In HIER workshop, Hildesheim,
FCM in this clustering. Moreover, visualizing cluster-               Germany, July 2015.
ing at 13 clusters with FCM (figure 2) and FCM with        [Hon09]    Soonwoo Hong. The magic of patent infor-
influence of RF (figure 3) using PCA, we observe the                  mation. World Intellectual Property Or-
data points are less scattered after using RF.                        ganization (WIPO). Available via DIA-
                                                                      LOG., December 2009.
6   Future work
                                                           [HXW+ 12] Ming Huang, Zhixun Xia, Hongbo Wang,
The main purpose of this study is to test whether the                Qinghua Zeng, and Qian Wang. The range
use of relevance feedback in clustering can play an im-              of the value for the fuzzifier of the fuzzy
portant role in grouping patents. In future we would                 c-means algorithm. Pattern Recogn. Lett.,
use explicit relevance feedback of the expert users to               33(16):2280–2284, December 2012.
modify the dimension of the vectors with more useful
features. We would also make use of lexical resources      [MRS08a] Christopher D. Manning, Prabhakar
on top of the feedbacks. In future we would also survey             Raghavan, and Hinrich Schütze. Intro-
the position of feedback vectors as they are selected               duction to Information Retrieval. Cam-
by users and may not be necessary appear around the                 bridge University Press, New York, NY,
centroid.                                                           USA, 2008.

                                                           [MRS08b] Christopher D. Manning, Prabhakar
References                                                          Raghavan, and Hinrich Schütze. Introduc-
                                                                    tion to Information Retrieval. Cambridge
[Bez81]        James C. Bezdek. Pattern Recognition                 University Press, Cambridge, UK, 2008.
               with Fuzzy Objective Function Algorithms.
               Kluwer Academic Publishers, Norwell,        [NNB15]    Janmenjoy Nayak, Bighnaraj Naik, and
               MA, USA, 1981.                                         HS Behera. Computational intelligence in
          data mining. Springer, New Delhi, India,
          2015.

[Off18]   European Patent Office. Epo quality re-
          port 2017. 2018.
[THTL06] Amy J. C. Trappey, Fu-Chiang Hsu,
         Charles V. Trappey, and Chia-I Lin. De-
         velopment of a patent document classifi-
         cation and search platform using a back-
         propagation network. Expert Syst. Appl.,
         31:755–765, 2006.
[TJC04]   Y. H. Tseng, D. W. Juang, and S. H.
          Chen. Global and local term expansion for
          text retrieval. In Proceedings of the fourth
          NTCIR workshop on evaluation of infor-
          mation retrieval, automatic text summa-
          rization and question answering, Tokyo,
          Japan, June 2004.
[TLL07]   Yuen-Hsien Tseng, Chi-Jen Lin, and
          Yu-I Lin. Text mining techniques for
          patent analysis. Inf. Process. Manage.,
          43(5):1216–1247, September 2007.