=Paper= {{Paper |id=Vol-2621/CIRCLE20_01 |storemode=property |title=Forecasting Patent Growth by Combining Time-Series Signals Using Covariance Patterns |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_01.pdf |volume=Vol-2621 |authors=Manajit Chakraborty,Seyed Ali Bahrainian,Fabio Crestani |dblpUrl=https://dblp.org/rec/conf/circle/ChakrabortyBC20 }} ==Forecasting Patent Growth by Combining Time-Series Signals Using Covariance Patterns== https://ceur-ws.org/Vol-2621/CIRCLE20_01.pdf
    Forecasting Patent Growth By Combining Time-Series Signals
                     Using Covariance Patterns
            Manajit Chakraborty                                     Seyed Ali Bahrainian                                    Fabio Crestani
    Università della Svizzera italiana (USI)               Università della Svizzera italiana (USI)           Università della Svizzera italiana (USI)
                 Switzerland                                            Switzerland                                        Switzerland
        manajit.chakraborty@usi.ch                             seyed.ali.bahreinian@usi.ch                            fabio.crestani@usi.ch

ABSTRACT                                                                             involved and complex technologies, such as those used in renew-
Bibliometrics has been employed previously with patents for tech-                    able energy, patents are usually built upon existing technologies.
nological forecasting. The primary challenge that technological                      In such cases, it can sometimes be difficult to identify a clear-cut
forecasting faces is early-stage identification of technologies with                 key patent.
the potential to have a significant impact on the socio-economic                     Technological forecasting has already been endorsed as an integral
landscape. Bibliographic measures such as citations, are a good                      element to stay ahead of the curve for corporations and govern-
indicator of technological growth. With this intuition, we carry out                 ments [7]. Previous studies, like the one by Acs et al. [1], suggested
an exploratory study using various time-series models and topic                      that patents provide a fairly reliable measure of innovative activity.
modeling over patent content to predict the growth or decline of                     Citation analysis and especially bibliometrics [3] has been used on
various bibliographic measures for topics in the near future. Intu-                  citation graphs to identify similar works or to calculate the impact
itively, in order to effectively uncover these citation trends shortly               factor of journals, researchers etc. Predicting citation counts for
after the patents are issued, we need to look beyond raw citation                    patents is non-trivial and also less useful because citation counts in
counts and take into account both the geographical and temporal                      patents do not change as rapidly as in scholarly papers or other web
information. We posit that, instead of using only citation counts for                articles . On the other hand, the change in citation counts, which
time-series prediction, judicious use of signals from topics gener-                  refers to the rise or decline of the patents of certain categories
ated from documents belonging to various geographical locations                      could provide us a quantitative as well as a qualitative overview
can help improve the performance. We carry out experiments on a                      of patent landscape. It will also indicate which topics, and in turn
large collection of patents and present some insightful results and                  which technological classes, are supposed to get traction in the up-
observations.                                                                        coming years. Discovering topics from patents and analyzing their
                                                                                     evolution over time is beneficial for making important decisions by
CCS CONCEPTS                                                                         research institutes, corporations, funding agencies, governments
                                                                                     and any other organization involved in production or promotion of
• Information systems → Data analytics.
                                                                                     intellectual properties. For example, research funding organizations
                                                                                     can adjust their granting policies based on insights produced by
KEYWORDS
                                                                                     predictive models in order to favor topics that are trending and
Correlation Analysis, Patent Citations, Prediction, Time series,                     gaining increasing attention rather than those that are losing mo-
Topic Modeling, LDA, Citation Growth.                                                mentum and interest.
                                                                                     There are several factors which determine how innovation evolves
1    INTRODUCTION                                                                    in a particular geographical location over a period of time which
A patent is a contract between the inventor or assignee and the                      includes political, social, environmental and judicial policies among
state, granting a limited period of time to the inventor to exploit                  others. While it is nearly impossible to chart all the factors and
the invention. The reasons for patenting could be myriad, ranging                    measure its impact on innovation, investigating how innovation
from the elementary need for exclusive rights to a technology or                     grows irrespective of such influences is still important. While topic
invention to building a positive image of an enterprise. Patents are                 models have been used to forecast emerging technologies from the
pivotal for technological innovation in the context where they ap-                   vantage point of technological classes [28], they have not been used
ply. They can be used to generate revenues, encourage synergistic                    in tandem with citations or to chart the citation growth of tech-
partnerships, or to create a market advantage and be the basis for                   nology classes. In this paper, we assume that technology classes
technological development.                                                           are represented by a group of patents which can further be de-
Patent citations, namely references to prior patent documents and                    lineated by topics drawn from them. We hypothesize that, if we
the state-of-the-art included therein, and their frequency are also                  can intelligently leverage the information from both citations and
often used as indicators for the technological and commercial value                  full-text of patents through topic modeling with additional inputs
of a patent [20]. Citations are also used to identify “key” patents,                 from the geographical regions from where certain topics emerge, it
which often varies depending on the nature of the technology.                        should aid us in predicting the growth of citations in the next time
In pharmaceutical technologies, e.g., a patent on one important                      slice with increased accuracy. In light of this, our contributions are
substance can be determined as a key patent. However, for more                       two-fold:

"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   https://hbr.org/1967/03/technological-forecasting
mons License Attribution 4.0 International (CC BY 4.0)."                             https://ec.europa.eu/invest-in-research/pdf/download_en/final_report_hcp.pdf
                                                                                              Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani


      • To provide a time-series representation that exploits and            topics that produce words with certain probabilities. Unlike latent
        infuses the signals arising from patent (text) content, geo-         semantic analysis [8], the topics coming from LDA are easier to
        graphical region and accompanying bibliographic measures.            interpret, because they are represented by combinations of words
      • To improve citation growth prediction using the infused              with contribution probabilities for each topic [29]. Besides, LDA is
        signal with various time-series models.                              known as one of the best topic models when dealing with a large
In order to achieve our first goal, we propose a correlation analysis        corpus and to interpret the identified latent topics [5].
scheme, which we term as CORrelation Analysis using COvariance
(CORACO), to harness the best of both worlds from topics derived                 LDA is known to outperform other dimension-reduction tech-
using LDA [5] from patents and bibliographic measure namely                  niques when dealing with a large corpus and to interpret the iden-
citation counts. The second goal is realized by employing regression         tified latent dimensions [5]. Regarding patent-based analysis, stud-
based time-series models on these infused correlated components.             ies have applied LDA to the technological trend identification of
Our experiments are performed on a large open-source patent                  greenhouse gas reduction technology [18], knowledge organization
collection, MAREC. The efficiency of our approach is corroborated            system development [12], and firms’ technological concentration
by the significantly improved prediction performance over three              trends on patent subjects [28]. LDA can identify sub-topics for a
different baseline models.                                                   technology area composed of many patents, and represent each of
                                                                             the patents in an array of topic distributions. Kim et al. [19] use
2     RELATED WORKS                                                          LDA for visualizing development paths among patents through
                                                                             sensitivity analyses based on semantic patent similarities and cita-
It has already been established that statistical analysis of interna-
                                                                             tions. Here, the authors use LDA to identify sub-topics of a given
tional patent records is a valuable tool for corporate technology
                                                                             technology. Topic models have also been employed for patent classi-
analysis and planning. Patents provide a wealth of detailed infor-
                                                                             fication [27] among other problems. Time-Series analysis has been
mation, comprehensive coverage of technologies and countries, a
                                                                             previously used by Holger Ernst [9] to examine the relationship
relatively standardized level of invention, and long time-series of
                                                                             between patent applications and subsequent changes of company
data [21]. So, it essentially provides us with an indicator to measure
                                                                             performance. We aim to harness the power of both the bibliographic
technological growth, which in turn could be extrapolated to get
                                                                             measures and topic modeling to provide us with a more accurate
a better understanding of the relation and mutual dependence of
                                                                             prediction of citation growth using several time-series models.
innovation and economics [16, 22]. One such study to analyze how
quantitative R&D and technology indicators may be used to fore-
cast company stock price performance was carried out by Patrick              3 METHODOLOGY
Thomas [26]. On the other hand, full-text analysis of patents using          3.1 Topic Extraction
topic modeling has also yielded interesting insights for technologi-         Our first step involved extraction of topics from the collection of
cal classification, clustering and prediction. With this in mind, the        patent documents that suitably indicate the latent themes of the col-
existing literature can be grouped into two broad themes in the              lection. For this, we employed LDA with default parameters. Since
context of our research problem:                                             the document collection is large and we needed to find the best rep-
                                                                             resentation of the data using LDA, we performed a optimum topic
2.1     Use of Citations for Technological                                   number estimation. To this end, similar to the method proposed by
        Forecasting                                                          Griffiths and Steyvers [10], we performed a model selection pro-
One of the early studies to measure the technological impact based           cess. This consists of keeping the LDA Dirichlet hyperparameters
on patent citations was done by Karki [17]. He proposed a host of            (commonly known as 𝛼 and 𝛽) fixed and assigning several values to
technological indicators based on citations among patents. Some              𝐾 (parameter for controlling the number of topics). We computed
studies, like the one by Albert et al. [2], have considered only cita-       an LDA model for each assignment, and subsequently, we picked
tions counts as indicators of industrially important patents. Zhang          the model that satisfies:
et al.[30] proposed to weight 11 indicators of patent value using                                    arg min𝐾 log 𝑃 (𝑊 |𝐾)
Shannon entropy, and selected forward citations as one of the most
important indicators for technological value. The basic motivation           where 𝑊 indicates all the words in the corpus. We repeated this
for using citations received as an indicator of quality is that cita-        process for 𝐾 from 100 to 600 in steps of 50 to find the optimal
tions indicate some form of knowledge spillovers. As argued by               number of topics for all the time slices. We found the optimum
Jaffe et al. [14], citations reflect the fact that either a new technology   value at 𝐾 = 500 topics.
builds on an existing one, or that they serve a similar purpose.             The next step involved measuring the topic strength T𝑗 of each topic
                                                                             𝑡 𝑗 per year 𝑦 which can be defined as in Equation 1, where |𝐷 𝑦 |
2.2     Topic Models for Patent Analysis                                     denotes the total number of documents in year 𝑦.
Latent Dirichlet Allocation (LDA) is a generative topic model which                                            |𝐷 𝑦|
finds latent topics in a text corpus, based on the assumption that au-
                                                                                                               Õ   𝑝 (𝑡 𝑗 |𝑑𝑖 )
                                                                                                      T𝑗,𝑦 =                                           (1)
thors generally write documents with respect to specific topics [5].                                           𝑖=1
                                                                                                                     |𝐷 𝑦 |
Using the LDA process, a document is represented as a mixture of
                                                                             The topic probabilities, 𝑝 (𝑡 𝑗 |𝑑𝑖 ), are produced by the LDA as scores
http://www.ifs.tuwien.ac.at/imp/marec.shtml                                  for each document 𝑑𝑖 along with the topics corresponding to 𝑑𝑖 .
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns




                Figure 1: Topic Strength Distribution                                             Figure 2: Citation Count Distribution


   Additionally, we also compute topic strength for each of the six                    certain topics. Hence, our next step was to observe the changes in
continents, since Antarctica has no patents, (listed in Section 4.2),                  citation patterns for topics over the 27 year period. The normalized
                                      𝑐𝑜𝑢𝑛𝑡𝑟 𝑦
T𝑗𝑐𝑜𝑛𝑡𝑖𝑛𝑒𝑛𝑡 , and 127 countries, T𝑗            , in the dataset for each               citation count C𝑗,𝑦 for topic 𝑡 𝑗 for year 𝑦 is determined according
of the years 1980 through 2006. The topic strength distribution                        to Equation 2:
                                                                                                                          Õ 𝑐𝑑 ,𝑦
helps us gauge the change in the importance of a topic over a                                                  C𝑗,𝑦 =
                                                                                                                                   𝑖
                                                                                                                                                         (2)
given period. This measure is better than topic frequency since it                                                               |𝐷 𝑦 |
                                                                                                                     𝑡 𝑗 →𝑑𝑖 ∈𝐷 𝑦
not only considers topic count but also accounts for the potential
contribution of a certain topic 𝑡 𝑗 to some document 𝑑𝑖 . For instance,                where 𝑐𝑑𝑖 ,𝑦 denotes the number of citations received by document
the topic:                                                                             𝑑𝑖 in year 𝑦. For the same topic example, “mobile radio station” in
                                                                                       the previous section, the corresponding citation count distribution
      • Topic: 2 (Mobile Radio Station)                                                is presented in Figure 2.
                                                                                       Now, that we have two different distributions or signals from text
      • Words: 0.236*“station” + 0.158*“mobile” + 0.088*“radio” +                      of patents i.e. topic strength and a bibliometric measure i.e. cita-
        0.052*“stations” + 0.014*“uplink” + 0.013*“downlink” +                         tions; we would like to judiciously combine them in such a way
        0.011*“access” + 0.010*“quality” + 0.010*“cellular” +                          that it maximizes the accuracy of prediction of citation growth (or
        0.008*“traffic”                                                                decline). Our objective thus translates to quantitatively capturing
extracted from our collection, which corresponds to “mobile radio                      the correlation between these two signals.
stations” depict the underlying topic strength distribution as in
                                                                                       CORACO: In this paper, we propose a method for finding corre-
Figure 1. From this figure, we can observe how the topic distribu-
                                                                                       lation components between two such independent distributions
tion changes with time depending on the geographical region (in
                                                                                       that maximize the commonality of two signals. We call this method
this case, countries). When observed across topics, it also gives a
                                                                                       as CORrelation Analysis with COvariance (CORACO). Thus, our
qualitative overview of which countries are more invested in which
                                                                                       problem can be redefined as a problem of finding two sets of basis
topics.
                                                                                       vectors, one for a and the other for b, such that the projections of
                                                                                       the variables onto the covariance matrix of the two signals would
3.2     Correlated Time-Series Analysis                                                be maximized. Here, for simplicity let us assume, a and b are place-
While topic strength analysis may provide us with some clues                           holders for T𝑘 and C𝑘 for any given topic 𝑡𝑘 , respectively. Let us
regarding how the topics and the corresponding patents related                         assume the linear combinations 𝑎 = a𝑇 ŵ𝑎 and 𝑏 = b𝑇 ŵ𝑏 of the
to the topics are changing over a certain period of time; citations                    two variables a and b respectively, where ŵ𝑎 and ŵ𝑏 are canoni-
provide us with a more tangible resource which helps us measure                        cal weights. We want to consider the case where only one pair of
the change in interest towards certain patents and by association,                     basis vectors are required corresponding to the largest correlation
                                                                                               Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani


component. This indicates that the function to be maximized is:              T𝑘 and C𝑘 , is 27 since we are observing the values for 27 years
                         𝐸 [𝑎𝑏]                                              (1980-2006). Now, while C𝑘 is fixed, the T𝑘 can vary depending on
                 𝜌 =p                                                        the geographical region. North America, Asia and Europe has the
                     𝐸 [𝑎 2 ]𝐸 [𝑏 2 ]
                                                                             largest share of patents, as depicted in Table 1. Also, from Table
                             𝐸 [a𝑇 ŵ𝑎 .b𝑇 ŵ𝑎 ]                             2 we observe the distribution of patents by country. We chose to
                   =q
                                                                      (3)    focus on the top-3 countries for our analysis, since they contribute
                     𝐸 [ ŵ𝑇𝑎 a.a𝑇 ŵ𝑎 ]𝐸 [ ŵ𝑇𝑏 b.b𝑇 ŵ𝑏 ]
                                                                             a large share (58.5%) of the total patents produced in the world.
                         w𝑇𝑎 C𝑎𝑏 w𝑏                                          Hence, we need to compute the following four sets of CORACO
                   = 𝑇
                    w𝑎 C𝑎𝑎 w𝑎 .w𝑇𝑏 C𝑏𝑏 w𝑏                                    components:
The maximum correlation component can thus be defined as the                               𝑋 𝑤𝑜𝑟𝑙𝑑 , 𝑌𝑤𝑜𝑟𝑙𝑑 = CORACO(T𝑘𝑤𝑜𝑟𝑙𝑑 , C𝑘 )
maximum value that 𝜌 can assume with respect to w𝑥 and w𝑦 . The
                                                                                                        𝑋 JP, 𝑌JP = CORACO(T𝑘JP, C𝑘 )
subsequent correlation components are uncorrelated for different                                                                                        (7)
solutions, i.e.:                                                                                      𝑋 US, 𝑌US = CORACO(T𝑘US, C𝑘 )
   𝐸 [𝑎 𝑎 ] = 𝐸 [w𝑇 a.a𝑇 w ] = w𝑇 C w = 0                                                            𝑋 GB, 𝑌GB = CORACO(T𝑘GB, C𝑘 )
    𝑖 𝑗                          𝑎𝑗        𝑎𝑖 𝑎𝑎 𝑎 𝑗
   
                       𝑎𝑖
   
     𝐸 [𝑏𝑖 𝑏 𝑗 ] = 𝐸 [w𝑇𝑏𝑖 b.b𝑇 b𝑏 𝑗 ] = w𝑇𝑏𝑖 C𝑏𝑏 w𝑏 𝑗 = 0 for 𝑖 ≠ 𝑗 . (4)      In Figure 3, we present the infused signals as provided by CORACO
   
   𝐸 [𝑎𝑖 𝑏 𝑗 ] = 𝐸 [w𝑇 a.b𝑇 w𝑏 𝑗 ] = w𝑇 C𝑎𝑏 w𝑏 𝑗 = 0
   
                                                                             for the World for the same topic as in Section 3.1. By World we
                       𝑎𝑖                  𝑎𝑖
The projections onto w𝑎 and w𝑏 , i.e. a and b, describe the under-           mean the complete set of patent documents in the collection. The
lying “latent” variables. Now, we know that for any two random               CORACO signals are new projected signals onto the covariance
variables m and n with zero mean, the total covariance matrix can            of the original signals. We can observe that CORACO successfully
be represented as:                                                           combines the distribution of topic strength of the World with its
                                                                           corresponding Citation Count distribution such that it minimizes
                         C        C𝑚𝑛          h            i
                  C = 𝑚𝑚                  =𝐸 m    n
                                                        m 𝑇
                                                        n
                                                                       (5)   points in the distribution where the covariance is large (e.g. between
                         C𝑛𝑚 C𝑛𝑛                                             1990-1995 in Figure 3). Essentially, it tries to bring both the signals
is a square matrix where C𝑚𝑚 and C𝑛𝑛 are the intra-set covariance            closer on a singular scale to achieve maximum points of similarity.
matrices of 𝑚 and 𝑛 respectively and C𝑚𝑛 = C𝑇𝑛𝑚 is the inter-set             These reinforced signals are then given to time-series models as
covariance matrix.                                                           input for prediction of citation growth. As results in Section 5 will
Thus, the correlation components between a and b can be found                show, using CORACO components instead of raw citation counts
by solving the eigenvalue equations:                                         indeed improves the performance thus validating our hypothesis.
                    (
                       −1 C C−1 C = 𝜌 2 ŵ
                      C𝑎𝑎  𝑎𝑏 𝑏𝑏 𝑏𝑎
                       −1 C C−1 C = 𝜌 2 ŵ
                                          𝑎
                                                               (6)           3.3     Citation Growth Prediction
                      C𝑏𝑏  𝑏𝑥 𝑎𝑎 𝑎𝑏       𝑏
                                                                             The final step is concerned with prediction based on the CORACO
where the eigenvalues 𝜌 2 are the squared correlations and the eigen-        components. We argue using CORACO components as input to the
vectors w𝑎 and w𝑏 are the normalized correlation basis vectors.              time-series models instead of using the raw distribution of citation
The number of non-zero solutions to these equations are limited to           counts for topics could significantly enhance the performance. This
the smallest dimensionality of a and b.                                      stems from the fact that CORACO components are representations
                                                                             for the correlation between two distributions which should be able
                      Country           No. of docs.                         to model the commonalities better. With this in mind, we choose
                      JP                  159,433                            to employ three different time-series models:
                      US                  148,434                                  1. Linear Regression or Autoregression (AR)
                      GB                   23,869                                  2. Moving Average (MA)
                      NL                   21,767                                  3. Simple Exponential Smoothing (SES)
                      IT                   15,795                            Due to lack of apparent trends or seasonality attributes in the
                      SE                    6,208                            observed variables, we could not use other models such as the
                      DE                    4,799                            Autoregression Moving Average (ARMA), the Autoregressive Inte-
                      CH                    3,990                            grated Moving Average (ARIMA) or the Seasonal Autoregressive
                      CA                    3,740                            Integrated Moving-Average (SARIMA). It is imperative to mention
                      KR                    3,634                            that our objective is to observe and predict the direction of change
                      ..                      ..                             in number of citations (increase or decrease) i.e. polarity (ΔC𝑘 ) of
                                                                             the citations for the next time window. As stated earlier, we consider
                   World         567,547
                                                                             one year as a time window. Thus, we are not concerned with predict-
    Table 2: Distribution of patents for top-10 countries                    ing the actual number of citations that a topic is supposed to gain
                                                                             in the next time window. This is because the number of citations
                                                                             a patent receives can vary on several exogenous factors such as
   In our case, the random variables a and b correspond to vectors
T𝑘 and C𝑘 for any given topic 𝑡𝑘 . The length of both these vectors,         https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns




                                                                                        Continent                              No. of docs.

                                                                                        North America (NA)                         220,153
                                                                                        Europe (EU)                                162,845
                                                                                        Australia (OC)                               4,129
                                                                                        South America (SA)                            664
                                                                                        Africa (AF)                                   890
                                                                                        Asia (AS)                                  178,288
                                                                                        Antarctica (AN)                                 0

                                                                                        World                                      567,547
                                                                                              Table 1: Distribution of patents by Continents
       Figure 3: CORACO components for T 𝑤𝑜𝑟𝑙𝑑 , C 𝑤𝑜𝑟𝑙𝑑


niche popularity of the technological area, a continuation of similar                  in other languages. In particular, we considered only English lan-
research or product by a company or group of companies, a country                      guage patents of the EP sub-collection. We had to discard a few
or region’s sudden interest in a particular technological class etc.                   documents with one or more missing fields such as classification
The trend of numbers of citations for any patent and thereby any                       codes, patent citations, applicant country etc. The final dataset
topic is non-decreasing, since citations are accumulative. Therefore,                  amounted to 567,547 documents. The citation network built out of
our goal is to predict whether the number of citations for a particu-                  this reduced dataset consisted of 646,537 citations. Admittedly, the
lar topic is going to increase or decrease in the next time window                     citation network is very sparse, which conforms to the norm that
compared to the last time slice. It then gives us an indication of                     patents are not as frequently cited as academic publications [6].
how popular a topic (and by extrapolation a technological class)
is going to be in the near future and provides funding agencies,
corporations and governments to adjust their funding strategies
                                                                                       4.2    Preprocessing
accordingly in areas which show a strong upward growth in the                          The patent collection has mainly two types of documents: (1) Type
next few years. The results and further analysis of the prediction                     A (A1, A2 ...): European patent application files. (2) Type B (B1, B2
performance are discussed in Section 5.                                                ...): European patent specification files. Of these, we used A1 and B1
                                                                                       documents since they are the most informative ones and contain
4 EXPERIMENTAL SETUP                                                                   the textual content of the patent. Now, it is imperative to mention
                                                                                       that in the collection, not all A1 documents have a corresponding
4.1 Dataset                                                                            B1 document. A1 documents contains Abstract along with other
For this study, we used the European Patent (EP) collection from                       bibliographic information including citations while B1 documents
the MAtrixware REsearch Collection (MAREC). MAREC is a static                          contain bibliographic information with Description and Claims of
collection of patent applications and granted patents in a unified                     the patent. Given this, we had to confine our collection to only
file format normalized from EP, WO, US, and JP sources, spanning                       patents that had both A1 and B1 counterparts. In case any one
a range from July 1976 to June 2008. The collection contains doc-                      of them was missing, we did not include them in our collection.
uments in several languages, the majority being English, German                        This step reduced our initial collection of English-language patents
and French, and about half of the documents include full text. In                      from 837,715 to 567,547. Post this step, we extracted all relevant
MAREC, the documents from different countries and sources are                          bibliographic information (such as application date, grant date, clas-
normalized to a common XML format with a uniform patent num-                           sification codes, applicant name, applicant country etc.) including
bering scheme and citation format. The standardized fields include                     citations in a separate file.
dates, countries, languages, references, person names, and com-                             We combined the title, abstract, description and claims for each
panies as well as rich subject classifications. It is a comparable                     patent into a single document which we refer to as ‘full-text’ of the
corpus, where many documents are available in similar versions                         patent in our paper. It should be noted that in this paper, we have
                                                                                               Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani


used the terms full-text of patents and documents interchangeably.            before computing the CORACO components for each case. The
Additional preprocessing steps include stopword removal (using an             comparative performance of the four models using four different
extended stopword list of over 800 words combining NLTK stop-                 geographical regions (World, Japan, USA and Great Britain) are pre-
words and other open source libraries), expunging unintelligible              sented in Tables 3 and 4. In Table 3, all the 500 topics are considered,
and non-alphabetic terms and tokenization. The cleaned documents              while in Table 4, only topics which do not have ‘UC’ (unchanged)
were then segregated based on sectors of technology (A-H) they                as their labels are considered. This is because, for any prediction
belong to and their countries and continents of origin (Table 1). The         algorithm, it is difficult to accurately predict the last element in
continent-wise distribution of the patent documents is presented              the series for the next time step. Even if the predicted and actual
in Table 1. The documents are also arranged by their year of patent           values are close, by our accuracy metric, it would be considered
registration date. So, each time slice is considered as a year. Based         as either one of ‘UP’ or ‘DOWN’. We can clearly observe that by
on this distribution, we chose to discard the documents from the              eliminating ‘UC’ labeled topics, the performance of all models in-
years 1978, 1979, 2007 and 2008, since there are very few documents           cluding baseline improves by a small margin. Also, the number of
in these years. Retaining these documents tends to skew the topic             such topics, 38, is quite low (7.6% of the total number of topics).
distribution negatively. The total number of these discarded docu-            From this table we can observe that in the best case, with Simple
ments amount to less than 1% of the whole collection. So, essentially         Exponential Smoothing, we achieve 78.3% better performance in
our patent dataset consists of patents from the year 1980 through             prediction when compared to all four baseline models. Among all
2006.                                                                         the CORACOs, CORACOGB is the worst performer with Autore-
                                                                              gression model and still it records a 39.9% improvement. In terms of
4.3     Tools                                                                 CORACO components, the topic strength distribution of the world
For LDA, we used the open source tool Gensim [24], with default set-          is shown to provide synergistic improvement to the prediction of
tings. The time-series models were employed from the statsmodels              polarity of the citations. While, among the three countries United
package [25] built in Python. Other Python packages used include              States of America, has the biggest influence in improving the pre-
matplotlib [13], SciPy [15], scikit-learn [23], pyconvert-country             dictions even though it is not the largest country in terms of patent
etc.. For experiments, we used a Linux based server with Intel(R)             production in our dataset. Comparing among time-series models,
Xeon(R) CPU @ 2.70GHz with 32 cores and 256 GB memory.                        Simple Exponential Smoothing seems to provide the biggest gain
                                                                              when CORACO signals are provided as input but fails poorly for
5     RESULTS AND ANALYSIS                                                    the baseline input. The improvements achieved by the CORACO
                                                                              models are statistically significant and hence the corresponding
   Baseline. For the baseline, we consider the actual citation count
                                                                              results have been marked with an asterisk in the tables.
distribution, C𝑘 , as input to the three time-series models. So, this
method basically tries to predict the change in citation counts based
on the historical data giving us four baseline models. We feed the            Prediction Error Comparison
citation counts for each of the 500 topics as a vector for years 1980-        While, our proposed models perform better than baselines, in terms
2005 as training data. The output from the time-series models are             of citation growth performance, it is interesting to also compare
then compared with the true labels of change of citation counts               the error in prediction of citation count values. In Table 5, we
for the last time slice (2006). The true labels indicate whether the          present the Mean Absolute Error (MAE) of the baseline models
citation count has actually risen or fallen from previous time step.          as well as our proposed models. It must be noted that since the
The labels are predetermined and can be one of the three types:               CORACO components have a different scale compared to baseline
     • UP: indicating that next time slice will receive more citations        models, we applied min-max normalization on all prediction error
        than the current time slice.                                          vectors (for 500 topics). From the table, we can clearly observe
     • DOWN : indicating that next time slice will receive less cita-         that the error produced by our CORACO approaches are lower
        tions than the current time slice.                                    than that of baselines which operate on actual citation counts.
     • UC: indicating that next time slice will receive as many cita-         In general, CORACOworld gives the best performance similar to
        tions as the current time slice.                                      citation growth prediction. So, we can positively conclude that
The label ‘UC’ (unchanged) is a typical case and occurs only for 38           not only does our proposed models perform better with respect
topics.                                                                       to citation growth prediction accuracy but even the citation count
                                                                              prediction errors are lower than all three baseline models.
   Metric. The metric for calculating the performance of CORACO
based prediction is the ratio of correctly predicted labels against all       6    CONCLUSIONS AND FUTURE WORK
true labels for 500 topics.
                                                                              Patent analysis delivers comprehensive competitive intelligence
                              |Correctly Predicted labels for ΔC𝑘𝑛+1 |model
       Accuracymodel =                                                        about innovators, relevant technologies, and help estimate the
                                        |True label for ΔC𝑘𝑛+1 |
                                                                              value of patents owned by competitor companies and governments.
The citation counts are integers, while the topic strengths are de-
                                                                              Patent Citations and Topic Models have been employed separately
scribed by decimal numbers, and also their scales are different.
                                                                              in existing literature towards forecasting of technological growth.
So, we had to perform min-max normalization on citation counts
                                                                              In this paper, we proposed a novel approach that leveraged both
https://www.nltk.org/                                                         patent citation counts and topic importance with geographical
https://pypi.org/project/pycountry-convert/                                   relevance to improve the prediction of patent citation growth in
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns


                              Input
                                           Baselines          CORACOworld               CORACOUS                CORACOJP                  CORACOGB
                Model
            AR                  0.108          0.452∗             0.446∗          0.454∗               0.428
            MA                  0.086          0.396∗             0.374∗          0.390∗               0.386∗
            SES                 0.048          0.470∗             0.416∗          0.438∗               0.434∗
Table 3: Accuracy comparison of models with all labels. Best performances are marked in bold. Statistically significant results
are marked with an asterisk (*)


                              Input
                                           Baselines          CORACOworld               CORACOJP                CORACOUS                  CORACOGB
                Model
            AR                  0.139       0.492∗             0.476∗          0.499∗            0.463∗
            MA                  0.193       0.428∗             0.414∗          0.402∗            0.417
            SES                 0.148       0.605∗             0.550∗          0.574∗            0.570∗
Table 4: Accuracy comparison of models without UC labels. Best performances are marked in bold. Statistically significant
results are marked with an asterisk (*)


                              Input
                                           Baselines          CORACOworld               CORACOJP                CORACOUS                  CORACOGB
                Model
                AR                   0.0371        0.0315          0.0321            0.0329           0.0324
                MA                   0.0416         0.0363         0.0358            0.0359           0.0357
                SES                  0.0383        0.0320          0.0330            0.0321           0.0334
                   Table 5: Prediction Error (MAE) comparison of models. Best performances are marked in bold.



the next year. To this end, we proposed a covariance based cor-                           [2] M.B. Albert, D. Avery, F. Narin, and P. McAllister. 1991. Direct validation of
related time-series method that maximizes the similarity of two                               citation counts as indicators of industrially important patents. Research Policy
                                                                                              20, 3 (1991), 251 – 259.
distributions. For prediction, we employed three time-series mod-                         [3] Leonidas Aristodemou and Frank Tietze. 2018. Citations as a measure of tech-
els and compared our approach against three baseline models by                                nological impact: A review of forward citation-based measures. World Patent
                                                                                              Information 53 (2018), 39 – 44.
also providing a comparative overview of the geographical region’s                        [4] Seyed Ali Bahrainian, Ida Mele, and Fabio Crestani. 2018. Predicting Topics in
influence on the prediction. Our results substantiate our hypothesis                          Scholarly Papers. In Advances in Information Retrieval - 40th European Conference
that correlated time-series model modifies the signal in such a way                           on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings. 16–28.
                                                                                          [5] D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent Dirichlet allocation. Journal of
that is superior to all baseline models using the original time-series                        Machine Learning Research 3, 4-5 (2003), 993–1022.
vectors.                                                                                  [6] Setfano Breschi, G Tarasconi, C Catalini, L Novella, P Guatta, and H John-
   As part of our future work, we would like to study the impact of                           son. 2006. Highly Cited Patents, Highly Cited Publications, and Research
                                                                                              Networks.       CESPRIBOCCONI UNIVERSITY, http://ec. europa. eu/invest-in-
our proposed approach on other complex time-series models such                                research/pdf/download_en/final_report_hcp. pdf (Accessed: 01/07/2019) (2006).
as LSTM networks [11]. We will investigate ways to extend our                             [7] Richard S. Campbell. 1983. Patent trends as a technological forecasting tool.
                                                                                              World Patent Information 5, 3 (1983), 137 – 143.
model to higher dimensions such that we could find representations                        [8] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
of multiple signals. We would also like to employ dynamic topic                               Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the
models for topic elicitation such as the one proposed by Bahrainian                           American Society for Information Science 41, 6 (1990), 391–407.
                                                                                          [9] Holger Ernst. 2001. Patent applications and subsequent changes of performance:
et al. [4] to account for topic evolution over time. Lastly, we will                          evidence from time-series cross-section analyses on the firm level. Research
apply our model to other time-series problems other than patent                               Policy 30, 1 (2001), 143 – 157.
analysis.                                                                                [10] Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings
                                                                                              of the National Academy of Sciences 101, suppl 1 (2004), 5228–5235.
                                                                                         [11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
                                                                                              Neural Computation 9, 8 (1997), 1735–1780.
ACKNOWLEDGEMENTS                                                                         [12] Zheng Hu and Xin-Yan Deng. 2014. Aerodynamic interaction between forewing
We thank the anonymous reviewers for their valuable comments.                                 and hindwing of a hovering dragonfly. Acta Mechanica Sinica 30, 6 (01 Dec 2014),
                                                                                              787–799.
This work was partially supported by The Global Structure for Knowl-                     [13] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science
edge Networks project grant under the SNSF National Research                                  & Engineering 9, 3 (2007), 90–95.
Programme 75 “Big Data” (NRP 75).                                                        [14] Adam B Jaffe, Manuel Trajtenberg, and Michael S Fogarty. 2000. The meaning
                                                                                              of patent citations: Report on the NBER/Case-Western Reserve survey of patentees.
                                                                                              Technical Report. National bureau of economic research.
                                                                                         [15] Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001–. SciPy: Open source
REFERENCES                                                                                    scientific tools for Python. http://www.scipy.org/ [Online; accessed 01/07/2019].
 [1] Zoltan J Acs, Luc Anselin, and Attila Varga. 2002. Patents and innovation counts    [16] Junegak Joung and Kwangsoo Kim. 2017. Monitoring emerging technologies for
     as measures of regional production of new knowledge. Research Policy 31, 7               technology planning using technical keyword based analysis from patent data.
     (2002), 1069 – 1085.
                                                                                                              Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani


     Technological Forecasting and Social Change 114 (2017), 281 – 292.                   [24] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
[17] M.M.S. Karki. 1997. Patent citation analysis: A policy analysis tool. World Patent        with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
     Information 19, 4 (1997), 269 – 272.                                                      for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
[18] Gabjo Kim, Sangsung Park, and Dongsik Jang. 2014. Technology Analysis from           [25] Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical
     Patent Data Using Latent Dirichlet Allocation. In Soft Computing in Big Data              modeling with python. In 9th Python in Science Conference.
     Processing, Keon Myung Lee, Seung-Jong Park, and Jee-Hyong Lee (Eds.). Springer      [26] Patrick Thomas. 2001. A relationship between technology indicators and stock
     International Publishing, Cham, 71–80.                                                    market performance. Scientometrics 51, 1 (2001), 319–333.
[19] Mujin Kim, Youngjin Park, and Janghyeok Yoon. 2016. Generating patent devel-         [27] Subhashini Venugopalan and Varun Rai. 2015. Topic based classification and
     opment maps for technology monitoring using semantic patent-topic analysis.               pattern identification in patents. Technological Forecasting and Social Change 94
     Computers & Industrial Engineering 98 (2016), 289 – 299.                                  (2015), 236 – 250.
[20] Doug Lichtman and Mark A Lemley. 2007. Rethinking Patent Law’s Presumption           [28] Bo Wang, Shengbo Liu, Kun Ding, Zeyuan Liu, and Jing Xu. 2014. Identifying
     of Validity. Stan. L. Rev. 60 (2007), 45.                                                 technological topics and institution-topic distribution probability for patent
[21] Mary Ellen Mogee. 1991. Using Patent Data for Technology Analysis and Plan-               competitive intelligence analysis: a case study in LTE technology. Scientometrics
     ning. Research-Technology Management 34, 4 (1991), 43–49.                                 101, 1 (01 Oct 2014), 685–704.
[22] Eleonora Pantano, Constantinos-Vasilios Priporas, Stefano Sorace, and Gianpaolo      [29] Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recom-
     Iazzolino. 2017. Does innovation-orientation lead to retail industry growth?              mending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International
     Empirical evidence from patent analysis. Journal of Retailing and Consumer                Conference on Knowledge Discovery and Data Mining (San Diego, California, USA)
     Services 34 (2017), 88 – 94.                                                              (KDD ’11). ACM, New York, NY, USA, 448–456.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.        [30] Yi Zhang, Yue Qian, Ying Huang, Ying Guo, Guangquan Zhang, and Jie Lu. 2017.
     Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-        An entropy-based indicator system for measuring the potential of patents in
     napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine              technological innovation: rejecting moderation. Scientometrics 111, 3 (01 Jun
     Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.            2017), 1925–1946.