=Paper= {{Paper |id=Vol-2658/paper7 |storemode=property |title=Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping |pdfUrl=https://ceur-ws.org/Vol-2658/paper7.pdf |volume=Vol-2658 |authors=Sahand Vahidnia,Alireza Abbasi,Hussein A. Abbass |dblpUrl=https://dblp.org/rec/conf/jcdl/VahidniaAA20 }} ==Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping== https://ceur-ws.org/Vol-2658/paper7.pdf
                    EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




           Document Clustering and Labeling for Research Trend
                   Extraction and Evolution Mapping
             Sahand Vahidnia                                  Alireza Abbasi                                 Hussein A. Abbass
        s.vahidnia@unsw.edu.au                            a.abbasi@unsw.edu.au                             h.abbass@unsw.edu.au
   School of Engineering and IT, UNSW              School of Engineering and IT, UNSW               School of Engineering and IT, UNSW
           Canberra, Australia                              Canberra, Australia                              Canberra, Australia
ABSTRACT                                                                      and predict scientific research dynamics and the drivers of that
In this study, a method is being proposed to extract research trends          dynamics in different forms such as the birth and death of scientific
and their temporal evolution, throughout discrete time periods. For           fields and their sub-fields [1] that can be identified by tracking
this purpose, a document embedding method is developed, adapting              the changes of research trends. A field / sub-field may go through
contextualized word embedding techniques. The method utilizes                 different stages, which consist of the birth, growth, and decline of
published academic documents as knowledge units, then clusters                scientific trends. The initial stage or the birth of a sub-field may
them into groups, each representing a series of related fields of             come from splitting and merging of other fields. Later, a field may
research. Various labeling techniques are also explored, including            attract more researchers and observe growth or can decline, as
source title popularity, author keyword popularity, term popular-             sociologists believe scientists either take a risky approach to make
ity, term importance, and Wikipedia-based automated labeling to               novel research or they prefer to stay on the safe side and stick to
evaluate the quality of clusters and explore their explainability. A          tradition [2][3]. There have been varying methods proposed and
case study is conducted on Artificial Intelligence (AI) related publi-        explored in the literature to analyze and understand the dynam-
cations, putting the method to test and observe the evolution of AI           ics of science considering the change of scientific fields and their
within the studied periods. In this study, we show that utilization           sub-fields. Topic modeling techniques such as Latent Dirichlet Allo-
of neural embeddings in conjunction with paragraph-term weights               cation (LDA) [4] and Latent Semantic Analysis (LSA) are amongst
would provide simple, yet reliable paragraph embeddings, that can             the most popular methods in the field that are used to understand
be used for clustering of the textual data. Additionally, we show             relationships among data and text documents [5], and network
that cluster centroids can be used for cluster tagging, labeling, and         analysis techniques such as co-occurrences of words, citation net-
inter-connecting for topic evolution study.                                   works are one of the most explored methods in the literature for
                                                                              revealing relationships in data. However, after recent developments
CCS CONCEPTS                                                                  in machine learning and natural language processing (NLP), new
                                                                              methods in text mining such as word and document embeddings
• Information systems → Data mining; Document topic mod-
                                                                              have facilitated analyzing the metadata or contents of publications
els; • Computing methodologies → Topic modeling.
                                                                              in different fields to understand the dynamics of those fields.
                                                                                 Understanding the dynamics of science and the ability to predict
KEYWORDS
                                                                              these dynamics and evolution of a field of science, helps us to
Dynamics of Science, Science Mapping, Text Embedding, Artificial              understand if there is something important left behind accidentally
Intelligence                                                                  or if there is a branch of science at a phase transition moving
                                                                              towards a major discovery. The ultimate objective of this research is
                                                                              to deepen our understanding of the dynamics of science and develop
1 INTRODUCTION                                                                methods and frameworks to make the historic analysis of science
                                                                              dynamics and temporal evolution possible and automated, and
Understanding and predicting future discoveries and scientific
                                                                              making predictions for future evolution possible for the scientific
achievements is an emerging field of research, which involves sci-
                                                                              community. The objectives of this research can be divided into the
entists, businesses, and even governments. This field is also known
                                                                              following two main categories:
as Science of Science (SciSci), which aims to understand, quantify
                                                                                 The objective of this study is to detect and map scientific trends.
                                                                              Revealing these trends requires us to exploit contextual features in
                                                                              the scientific research domain and understand its dynamics. In this
                                                                              study we propose a simple framework to facilitate the exploration
                                                                              of scientific trends and their evolution, utilizing contextual features
                                                                              and deep neural embeddings. Our proposed framework is then
                                                                              applied in a case study to understand the path of scientific evolution
                                                                              in artificial intelligence. In this study, we show how the trends
                                                                              and topics in science can be extracted using document vectors and
                                                                              extraction of context. The overall outline of the proposed framework
                                                                              is as illustrated in Fig.1.




      Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                         54
                      EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



EEKE2020, August 1, 2020., Wuhan, China                                                                Sahand Vahidnia, Alireza Abbasi, and Hussein A. Abbass




                                          Figure 1: Flowchart of the proposed framework outline.


2    LITERATURE REVIEW                                                          been used for identifying and analyzing topics and their relation-
Many previous studies in the field rely on word co-occurrences                  ships (formerly studied in [13]), in a sequential time period in this
to map the scientific fronts. A relatively early and very influential           study. SEP follows a similar path to a typical clustering algorithm,
study in co-word networks has been conducted by Van Den Besse-                  but in sequential temporal order. The study utilizes Salton’s cosine
laar et. al [6] analyzing research topics based on co-occurrences of            measure [14] to assign topics and articles. The study acknowledges
word-reference combinations. They put the structure of science into             that using word embedding techniques could improve the result,
four levels: (1) discipline (e.g., computer science); (2) research field        instead of term frequency based vectors. In another study [15], they
(e.g. AI); (3) sub-field (e.g., machine learning); and (4) research top-        continue the previous work [11], using Word2Vec [12] as embed-
ics (e.g., deep learning). Sedighi [7] has analyzed the research areas,         ding technique, coupled with a kernel k-mean clustering algorithm.
their relationships, and growth trends in the field of Informetrics             As have been shown in many other studies, word2vec and other
using word co-occurrence. Chen et al. [8] in a study utilize co-                embedding and language models can exploit more complex fea-
word analysis to reveal the structure and development of research               tures in textual data. Hence, it can exploit the desired features for
fields. For this purpose, factor analysis, cluster analysis, multivari-         clustering purposes. In the experiments conducted in this work,
ate analysis, and social network analysis, using the matrix of word             pre-trained 100-dimensional vectors have been utilized. As for clus-
co-occurrences have been performed. It used the meta-data of 2054               tering, a polynomial kernel k-means with cosine distance measure
funded projects from 2011 to 2015 and only the keywords having                  have been adapted to better cluster bibliometric features.
more than 8 repetitions are considered (6,153 keywords). Authors                   In a study of technology trend monitoring [16], a framework is
have used Matlab to get the co-occurrence matrix and other simi-                suggested to use patent data in conjunction with Twitter data. Due
larity analyses, including the co-correlation matrix, have been done            to the lag in the patent data and not capturing the whole technolog-
using UCINET and further SNA have been done using VOSviewer.                    ical advances, utilization of Twitter data is being suggested in this
Zhao et al. in a study [9] seek to find the relationships among dif-            work, which comprises many technological discussions, prior to
ferent theme ranking metrics comprised of frequency-based and                   their publication. The clustering in this study has been carried out
network-based methods. The study categorizes the metrics into                   using Lingo algorithm. The study uses Carrot2 workbench for visu-
three groups: (1) degree centrality, H-index, and coreness, (2) be-             alization of patent clusters. Then author-topic over time (ATOT)
tweenness centrality, clustering coefficient, and frequency, and (3)            model is used to analyze the tweets and obtain topic-feature words
weighted PageRank. The study suggests that recently co-word anal-               probability distribution and topic-user probability distribution. Fi-
ysis has shifted to network-based metrics and attempts to examine               nally, in a recent review study [17], different document clustering
the relationships among these metrics of term ranking. In the em-               and topic model methods are compared and evaluated. The study
pirical phase of the study, Keywords Plus from WoS data has been                confirms the advantage of advanced embedding methods in contrast
used instead of extracting keywords from the text and using author              to traditional methods like tf-idf. The study claims that methods like
keywords, as many author keywords are missing in data. These                    doc2vec [18] with tf-idf weights would outperform other methods.
keywords have been used in the co-word analysis in the afore-                   They also show that it is possible to readily use doc2vec in with
mentioned three fields, using the Pajek tool. There also have been              k-means clustering.
other studies utilizing similar techniques in other fields like Yang et
al. [10], which is a study of finding research trends about vitamin
D.                                                                              3 METHODOLOGY
    In a study of knowledge evolution detection and prediction,
Zhang et al. [11] propose a topic-based model, utilizing LDA and sci-           3.1 Data Collection
entific evolutionary pathway modeling (SEP). The study uses LDA                 The language model training data has been collected via Scopus
to profile the articles published in the Knowledge-based Systems                from 1990 to 2019, by “artificial intelligence” query search key in
journal and generates 25 topics for the 2566 articles. An interesting           titles, abstracts, and keywords fields, yielding 310k records (dataset
workaround is suggested in this study, which is to concatenate                  A). This collection method allows us to increase the variance in
the n-grams to form a uni-gram which bypasses the preference of                 training data, resulting in better generalization. In contrary, the
LDA in single words. This workaround has also been used in other                main data for the analysis has been collected from three mainstream
methods including word2vec [12]. Later, the relationships among                 journals in AI from 1970 to 2019 (dataset B): “Artificial Intelligence”
these topics have been evaluated using co-topic networks. SEP has               (2575 records), “Artificial Intelligence Review” (890 records), and




                                                                           55
                       EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping                                     EEKE2020, August 1, 2020., Wuhan, China


                                                                                       important keywords. Hence, n-gram keywords with a frequency
                                                                                       of at least six are kept and the rest are ignored. In addition, a
                                                                                       condition of M>2N has been maintained to keep the keywords
                                                                                       with N-grams and M characters. This eliminates the keywords
                                                                                       with characters counts lower than 2N, which usually are generic
                                                                                       words and potentially harmful for the data and the text corpus. For
                                                                                       this purpose, N-grams are sorted from higher N to lower N, then
                                                                                       replaced in the corpus with corresponding words. In this study,
                                                                                       N ∈ {1, ..., 6}, as numbers over 6 are usually either errors or very
Figure 2: Data growth from 1970 to 2019 in the three jour-                             sparse (please refer Fig. 3 for the histogram of N in n-grams). To
nals, yielding over about 4300 records.                                                eliminate any chance of mid-word overwriting, all searches are
                                                                                       done by leading and ending spaces, and to make the corpus suitable
                                                                                       for this, a leading and ending space is added to all data records. (3)
                                                                                       Data is divided into nine periods by publication year, to [1970,1989]
                                                                                       , [1990,1994], [1995,1999], [2000,2004], [2005,2007], [2008,2010],
                                                                                       [2011,2013], [2014,2016], and [2017,2019] periods. This division into
                                                                                       nine periods provides us with more uniform count of records within
                                                                                       each period, facilitating the clustering approach.


Figure 3: N-gram dictionary histogram of the dataset. Y-axis
                                                                                       3.3    Contextualized Embedding for Document
is frequency and X-axis is N.                                                                 Clustering
                                                                                       Frequency-based analyses are not the only ways to cluster docu-
                                                                                       ments for understanding their topics. Another way to get the topics
“Journal of Artificial Intelligence Research” (1006 records). The rea-                 within a set of documents is to use contextualized embeddings. As
son for excluding other journals is to limit or eliminate the bias                     the name suggests, this provides further context awareness to the
in some journals towards specific applications (e.g. health) or ap-                    approaches of uncovering topics and latent information in the text.
proaches (e.g., engineering and deep learning) in AI. As observed                      There also have already been studies to automatically categorize
in the Web of Knowledge master journal list categories, these three                    or group research trends [10] [8] [6]. Yet they rely on statistical
journals were selected to best fit the purpose of this research, hav-                  methods and/or network attributes of entities such as co-word or
ing the minimum bias to specific applications or approaches. Fig. 2                    citation networks. In this study, we are leveraging the strength
illustrates the data records throughout the period of the study data.                  of contextualized embedding techniques when categorizing docu-
                                                                                       ments. Later, we define the research trends by their corresponding
3.2     Data Pre-processing                                                            keywords. To facilitate this in labeling, authors’ keywords is uti-
After the acquisition of data, the following initial pre-processing                    lized to enhance the context and capture the mindset of authors for
steps were conducted on the datasets A and B: (1) Removal of dupli-                    research-front clustering.
cated records by their Digital Object Identifiers (DOI). (2) Removal
of records with missing abstracts. (3) Concatenating the titles and                    3.3.1 Embedding Method. Needless to say, the main ingredient of
the abstracts: using the title as the initial sentence of an abstract,                 text clustering techniques is to represent the data in vector space.
then saving them in the abstract column. (4) Lemmatization of ab-                      Word embeddings or vectors have been around for a long time.
stracts: Noun level lemmatization and skipping the other parts of                      Simple word vectors are one-hot embeddings like bag-of-words.
speech. (5) Replacement of very famous acronyms with their corre-                      However, they don’t provide much information regarding the data.
sponding terms. (6) Removal of words like ’et al.’, ’eg.’, ’ie.’, and ’fig.’,          Many methods incorporate simple statistical vectors like tf-idf [19]
which generally have tailing dots and would hamper with the sen-                       or bag-of-words. That is why models like Word2Vec (W2V) [12]
tence extraction process, without carrying meaningful information.                     were introduced. Regarding the clustering task, it has been demon-
(7) Converting all British English words to American English words                     strated in prior studies that neural embeddings outperform other
for consistency of the data. (8) Removal of punctuation, special                       embedding techniques [17]. W2V is a single hidden layer neural
characters, and numbers. (9) Sentence extraction for training the                      network and works with two different models: Common Bag of
language model (dataset A only).                                                       Words (CBOW), and skip-gram model. CBOW model tries to predict
   The secondary pre-processing stage is as follows, which is carried                  a word based on its context (surrounding words). Word embedding
out on the analysis data (dataset B) only. This data is only used in                   techniques benefit from neural networks to generate embedding
the labeling stage, which will be denoted “label data”: (1) Removal of                 vector representations of words [20]. The method of choice for
stop words, like “a” and “the” from the corpus. (2) Concatenation of                   generating vectors in this study is FastText [21], due to its richer
n_grams based on the taxonomy generated from author keywords                           embeddings. FastText is very similar to word2vec in nature, with
in the dataset A, by replacing spaces with underscores (artificial                     few more tricks. FastText also leverages a single layer neural net-
intelligence -> artificial_intelligence). This taxonomy only contains                  work, which makes it very fast and simple. A feature of FastText
the 95 percentile of keywords n-gram keywords, to cover the most                       which makes it stand out in comparison to similar methods is the




                                                                                56
                      EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



EEKE2020, August 1, 2020., Wuhan, China                                                              Sahand Vahidnia, Alireza Abbasi, and Hussein A. Abbass


utilization of sub-word features and n-grams. Character level fea-
tures have been explored in more complex methods too, but they
usually lack the speed, efficiency, and accuracy of FastText. Yet,
there exist models that can outperform FastText, such as BERT [22],
preserving features from bi-directional word orders and sub-word
information. BERT is far more complex and resource-intensive than
FastText in training and fine-tuning, providing vectors of very large
dimensions.

3.3.2 Word Vectors and Dimensions. It is preferred to utilize low
dimension size in embeddings as opposed to the original dimension
size of pretrained FastText embeddings. The curse of dimensional-             Figure 4: Dendrogram of document vector clusters in the pe-
ity is known to be a common problem when dealing with similar                 riod of 1970 to 2019.
tasks [23] [24]. Gensim library [25] has been used here to generat-
ing 50-dimensional FastText models, using the large corpus (dataset
A). It has been concluded empirically that 50 would be an optimal             3.3.4 Clustering Approach. Comparing to other well-known clus-
dimension size, dealing with clustering tasks of this scale. The word         tering methods like k-means, it was observed that the most suc-
embedding task is a fake training task, to retrieve word weight from          cessful clustering technique to fit our data and task is hierarchi-
the neural networks, which is used as the word embeddings. Thus,              cal agglomerative clustering with Ward’s method [29]. This is a
each dimension may represent a specific feature of a document or              bottom-up hierarchical clustering technique, which minimizes the
text. Due to the complexity of the dimensions and their meanings              total within-cluster variance. Hierarchical clustering can provide
when dealing with document and text clustering, no manual feature             a number of benefits and flexibilities like decision support on the
engineering is carried out. Additionally, no further dimensional-             number of clusters. The selection of cluster numbers for hierarchi-
ity reduction is used, as it is possible to select the output size of         cal clustering usually can be done via a dendrogram. Dendrogram
the neural network in FastText, rendering further utilization of              basically shows the hierarchical structure of the nodes, based on
auto-encoders and similar methods less useful. Simpler dimension-             their closeness to each other, as illustrated in Fig. 4.
ality reduction methods like PCA were also attempted, yielding in
                                                                              3.3.5 Cluster Labeling Approach. Cluster labeling has been carried
sub-optimal results. It was observed that dimensionality reduction
                                                                              out using two different methods. The initial method uses important
techniques for this task have little to no positive effect. Hence the
                                                                              words within the text. Using this method, clusters are tagged using
raw FastText embeddings are preferred in this study.
                                                                              a normalized tf-idf scoring method to extract the important words
                                                                              within each cluster by providing further discrimination to cluster
3.3.3 Document Embedding and Vectors. As we aim to cluster doc-               term content, from count vectorization of terms with less than 0.8
uments based on their scientific representation, author keywords              presence in documents. The following equation shows the scoring
are ignored and the embedding is based solely on document titles              technique for cluster tags.
and abstracts. For this task, document vectors are required to be
calculated. There have been a number of studies to calculate docu-
                                                                                                    score(t, c) = t f (t, c) ∗ ic f (t)           (3)
ment vectors and document clustering, including [26] [27] [28]. An
                                                                                                                      (1 + n)
intuitive method is to average the word representations to acquire                                   ic f (t) = loд               +1              (4)
document vectors. However, it won’t provide stable results. Arora                                                   (1 + c f (t))
et al [28] provide a baseline method, which we have adopted in                   Where t is a term, c is the corresponding cluster, c f is the fre-
this study for document embedding. The method is called “Smooth               quency of clusters with term t, and n is the number of clusters.
Inverse Frequency” (SIF). SIF is basically a weighted averaging                  These top term tags can help us identify the topic and subject
method, based on probability and inverse frequency for words in               area of each cluster and its cover. These tags are used to extract the
documents and is claimed to have 5 to 13% improvements, thus is               important terms within each cluster and can be used to roughly
adapted in this study. The SIF adaptation in this study is illustrated        estimate the overall cluster label. In other words, these terms sum-
at the following equation, where wv(t) is calculated for each term            marize the context of each cluster in a couple of keywords. However,
for all strings, and then is divided by the number of terms in the            this can only loosely define the labels and fields, without any formal
corresponding string. Here v(t) represents each term vector, and              definition, which renders it less useful, unless used in conjunction
p(t) is the probability of seeing that term. Regarding the α, the             with expert opinion. To address this problem, another method is
constant value of 1e − 3 is used.                                             developed to label clusters, utilizing the definitions within “Outline
                                                                              of artificial intelligence” in Wikipedia. Hence, all pages from both
                                                                              “Applications” and “Approaches” of AI in this outline are parsed and
                                         α                                    vectorized, yielding a single 50-dimensional vector for each “Appli-
                             w(t) =                               (1)
                                      α + p(t)                                cation” and each “Approach”. The vectorization steps are as follows:
                                                                              First, each page is parsed, all unnecessary data noise, including
                                                                              references and titles, are cleaned. Then each page is turned into a
                           wv(t) = w(t) ∗ v(t)                    (2)         corpus of sentences and each sentence is embedded individually




                                                                         57
                        EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping                                          EEKE2020, August 1, 2020., Wuhan, China


                                                                                           sensible topic clusters. For instance, the domination of machine
                                                                                           learning (ML) is very natural among AI subjects, which is also the
                                                                                           case in the results. In another instance, it is obvious from Table
                                                                                           5 and Table 6 that “Decision support system” peaks during this
                                                                                           period, where it was non-existing in the prior periods and also
                                                                                           missing in the next period. To validate this claim, the trends of the
                                                                                           science can qualitatively be compared during the corresponding
                                                                                           periods by searching for decision support system*” and “artificial
                                                                                           intelligence” in Scopus (See Fig.5). The appearance of “Automatic
                                                                                           target recognition” in conjunction with “Computer vision” from
Figure 5: Query result for studies of decision support system
                                                                                           the period of 1995-1999 also aligns with the real world trends of
and AI trend. Y-axis is Frequency.
                                                                                           science and the breakthroughs. This pattern of the fields, being
                                                                                           sorted next to each other is also interesting and provides further
using the SIF method and parameters, introduced at 3.3.3. Finally,                         assistance to the interpretation task. Overall, utilizing the proposed
to obtain the page vector, the mean of all sentence vectors are cal-                       method demonstrates the simplicity of the interpretation of topics.
culated, providing us with the centroid of the Wikipedia document,                         It is perceived from the results that Wikipedia applications are the
or in other words, the location of approaches and applications of AI                       most helpful tags, in contrast to Wikipedia approaches, which only
in our vector space. To utilize these vectors for obtaining Wikipedia                      provide minor assistance in understanding the cluster concept.
approaches and applications of AI, the centroid of each cluster is                             This also applies to the top words and the yielding analysis, as
compared to each of the approaches and applications of AI, using                           they are also perceived very helpful in the identification of clusters.
their cosine similarities, as defined in equation 5. Cosine similar-                       As explained at 3.3.5, top terms are very helpful in understanding
ity has been used for NLP tasks [30] [11] as it is known to be the                         the cluster. They provide support in deciding the correct cluster
best measure fitting this task. The top two closest approaches and                         tags. Hence, they require expert opinion to form the final cluster
applications are then selected as labels for each cluster.                                 label, similar to some prior studies [11]. The alignment between the
                                                                                           Wikipedia applications and the top words is visible in many periods.
                                              a.b
                            sim(a, b) =                                        (5)         Yet, it should be acknowledged that some experts in the field may
                                           ||a||.||b||                                     disagree on the meaning of the top words and may interpret them
3.3.6 Research Trend Mapping. The final stage of the proposed                              differently in comparison to the Wikipedia topics. Therefore, this
framework comprises the mapping of the evolution of scientific                             has been left as it is and expert opinion is not used for this part
trends. To accomplish this, all the inter-period cluster centroids                         of the analysis. To facilitate the interpretation of the top words,
are compared and the most similar neighboring periods (up to                               they’ve been recorded with the corresponding tf-idf score of the
two periods further) are connected based on a constant threshold                           term in the cluster of interest. This facilitates the identification of
value, which is an empirical threshold value, just over the intra-                         clusters and provides more weights to understand the importance
period similarities. To illustrate the connections among each cluster                      of a term for labeling each cluster. Only sample results are provided
throughout the periods of time, the Sankey diagram is used in this                         in Table 1 due to page constraints.
study.                                                                                         The evolution mapping of the fields with two different labeling
                                                                                           approaches is presented in Fig.6 and Fig.7. This mapping connects
4 RESULTS                                                                                  the topics extracted throughout all periods. This diagram connects
                                                                                           two or three consecutive clusters based on the similarity.
4.1 Document Clustering and Mapping
The method was implemented on 50-dimensional vectors and we
noticed that the results from the SIF influenced method lived up to
the expectation by providing us with more separable clusters com-
pared to unweighted averaging. This was most noticeable during                             5   CONCLUSION
the cluster number estimation, as the clusters of weighted averaged                        In this study, we proposed and implemented a framework to ex-
documents were further apart.                                                              tract scientific trends and visualize their evolution in discrete time
   The Wikipedia based labels, which are basically the estimations                         periods. The study shows that this framework and labeling method
of Wikipedia AI approaches and applications, based on the similar-                         facilitates the identification of trends and assist us in understand-
ity of cluster centroids to document vectors, are illustrated at the                       ing the way fields of research are evolving. This became possible
sample result Tables 2 to 7. 1 .                                                           through the top term and Wikipedia application labeling methods.
   Referring to the aforementioned tables, the overall theme of the                        We also show that Wikipedia documents can be used to have an
topics in the three AI journals can be perceived. The results from                         estimated embedding location of a field of research or an applica-
Wikipedia Approach Estimation, Wikipedia Application Estima-                               tion in vector space. Yet, Wikipedia approaches are not as useful
tion, and top terms would align well in many cases and it creates                          as Wikipedia application for this case study and purpose. In future
1 Abbreviations: PR: pattern recognition, NLP: natural language processing, ML: ma-
                                                                                           works, more advanced clustering methods are planned to be used
chine learning, DSS: decision support system, ES: expert system, KM: knowledge             as an extension to this work, benefiting from deep neural networks
management, CV: computer vision, auto.: automated                                          in clustering and dynamic embedding and clustering techniques.




                                                                                      58
                           EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



EEKE2020, August 1, 2020., Wuhan, China                                                                                          Sahand Vahidnia, Alireza Abbasi, and Hussein A. Abbass


REFERENCES                                                                                           [27] A. M. Dai, C. Olah, and Q. V. Le, “Document embedding with paragraph vectors,”
 [1] A. Zeng, Z. Shen, J. Zhou, J. Wu, Y. Fan, Y. Wang, and H. E. Stanley, “The science of                arXiv preprint arXiv:1507.07998, 2015.
     science: From the perspective of complex systems,” Physics Reports, vol. 714-715,               [28] S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence
     pp. 1–73, 2017.                                                                                      embeddings,” 2017.
 [2] J. G. Foster, A. Rzhetsky, and J. A. Evans, “Tradition and innovation in scientist-             [29] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal
     sâĂŹ research strategies,” American Sociological Review, vol. 80, no. 5, pp. 875–908,                of the American statistical association, vol. 58, no. 301, pp. 236–244, 1963.
     2015.                                                                                           [30] V. Saquicela, F. Baculima, G. Orellana, N. Piedra, M. Orellana, and M. Espinoza,
 [3] S. Fortunato, C. T. Bergstrom, K. Börner, J. A. Evans, D. Helbing, S. Milojević, A. M.               “Similarity detection among academic contents through semantic technologies
     Petersen, F. Radicchi, R. Sinatra, B. Uzzi, A. Vespignani, L. Waltman, D. Wang,                      and text mining.,” in IWSW, pp. 1–12, 2018.
     and A. L. Barabási, “Science of science,” Science, vol. 359, no. 6379, 2018.
 [4] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of
     machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
 [5] H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao, “Latent dirichlet
     allocation (lda) and topic modeling: models, applications, a survey,” Multimedia
     Tools and Applications, vol. 78, no. 11, pp. 15169–15211, 2019.
 [6] P. Van den Besselaar and G. Heimeriks, “Mapping research topics using word-
     reference co-occurrences: A method and an exploratory case study,” Scientomet-
     rics, vol. 68, no. 3, pp. 377–393, 2006.
 [7] M. Sedighi, “Application of word co-occurrence analysis method in mapping of
     the scientific fields (case study: the field of informetrics),” Library Review, vol. 65,
     no. 1/2, pp. 52–64, 2016.
 [8] X. Chen, J. Chen, D. Wu, Y. Xie, and J. Li, “Mapping the research trends by
     co-word analysis based on keywords from funded project,” Procedia Computer
     Science, vol. 91, pp. 547–555, 2016.
 [9] W. Zhao, J. Mao, and K. Lu, “Ranking themes on co-word networks: Exploring
     the relationships among different metrics,” Information Processing & Management,
     vol. 54, no. 2, pp. 203–218, 2018.
[10] A. Yang, Q. Lv, F. Chen, D. Wang, Y. Liu, and W. Shi, “Identification of recent
     trends in research on vitamin d: A quantitative and co-word analysis,” Medical
     science monitor: international medical journal of experimental and clinical research,
     vol. 25, p. 643, 2019.
[11] Y. Zhang, H. Chen, J. Lu, and G. Zhang, “Detecting and predicting the topic
     change of Knowledge-based Systems: A topic-based bibliometric analysis from
     1991 to 2016,” Knowledge-Based Systems, vol. 133, pp. 255–268, 2017.
[12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed rep-
     resentations of words and phrases and their compositionality,” in Advances in
     neural information processing systems, pp. 3111–3119, 2013.
[13] Y. Zhang, G. Zhang, D. Zhu, and J. Lu, “Scientific evolutionary pathways: Identi-
     fying and visualizing relationships for scientific topics,” Journal of the Association
     for Information Science and Technology, vol. 68, pp. 1925–1939, aug 2017.
[14] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New
     York, NY, USA: McGraw-Hill, Inc., 1986.
[15] Y. Zhang, J. Lu, F. Liu, Q. Liu, A. Porter, H. Chen, and G. Zhang, “Does deep
     learning help topic extraction? a kernel k-means clustering method with word
     embedding,” Journal of Informetrics, vol. 12, no. 4, pp. 1099–1117, 2018.
[16] X. Li, Q. Xie, J. Jiang, Y. Zhou, and L. Huang, “Identifying and monitoring the de-
     velopment trends of emerging technologies using patent analysis and twitter data
     mining: The case of perovskite solar cell technology,” Technological Forecasting
     and Social Change, vol. 146, pp. 687–705, 2019.
[17] S. A. Curiskis, B. Drake, T. R. Osborn, and P. J. Kennedy, “An evaluation of
     document clustering and topic modelling in two online social networks: Twitter
     and reddit,” Information Processing & Management, vol. 57, no. 2, p. 102034, 2020.
[18] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,”
     in International conference on machine learning, pp. 1188–1196, 2014.
[19] K. S. Jones, “A statistical interpretation of term specificity and its application in
     retrieval,” Journal of Documentation, vol. 28, pp. 11–21, 1972.
[20] J. Kim, J. Yoon, E. Park, and S. Choi, “Patent document clustering with deep
     embeddings,” Scientometrics, pp. 1–15, 2020.
[21] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text
     classification,” arXiv preprint arXiv:1607.01759, 2016.
[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
     deep bidirectional transformers for language understanding,” arXiv preprint
     arXiv:1810.04805, 2018.
[23] M. Steinbach, L. Ertöz, and V. Kumar, “The challenges of clustering high di-
     mensional data,” in New directions in statistical physics, pp. 273–309, Springer,
     2004.
[24] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A sur-
     vey on subspace clustering, pattern-based clustering, and correlation clustering,”
     ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 1, p. 1,
     2009.
[25] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large
     Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP
     Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May 2010.
[26] J. Park, C. Park, J. Kim, M. Cho, and S. Park, “Adc: Advanced document clustering
     using contextualized representations,” Expert Systems with Applications, vol. 137,
     pp. 157–166, 2019.




                                                                                                59
                       EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping                                  EEKE2020, August 1, 2020., Wuhan, China


                                                Table 1: Top 7 term tags for the period of 2017-2019.

 Tag ( TF-IDF score)
 * cluster (0.226), clustering (0.194), ba (0.156), twsvm (0.148), support vector machine (0.147), neural network (0.119), si (0.117)
 * queen (0.537), kemeny (0.224), top (0.173), bound (0.158), borda (0.153), mining (0.15), item (0.148)
 * logic (0.369), semantics (0.218), answer set (0.203), formula (0.179), cp net (0.177), revision (0.152), asp (0.151)
 * market (0.257), sale (0.226), firm (0.226), car (0.164), customer (0.157), kidney (0.157), bike (0.157)
 * knee (0.319), face recognition (0.253), acl (0.209), gait (0.198), gait pattern (0.176), facial (0.176), survey (0.172)
 * planning (0.272), heuristic (0.237), plan (0.201), abstraction (0.181), search (0.177), planner (0.16), monte carlo tree search (0.13)
 * sentiment analysis (0.268), survey (0.245), text (0.179), metadata (0.154), area (0.14), indian language (0.133), citation (0.124)
 * word (0.271), entity (0.211), sentiment (0.176), vietnamese (0.135), sentiment analysis (0.13), semantic (0.124), target (0.122)
 * voting (0.233), voter (0.218), cost (0.16), mirl (0.15), player (0.142), good (0.141), preference (0.139)
 * inconsistency (0.231), semantics (0.156), attack (0.153), belief (0.153), argument (0.143), graph (0.139), argumentation framework (0.136)
 robot (0.401), team (0.217), trust (0.17), teammate (0.139), belief (0.121), revision (0.12), norm (0.112)




Figure 6: Sankey diagram of the research clusters with top words as labels. The initial digits are referring to the ending year
of period.

                      Table 2: Wikipedia based approaches and applications (topics) for the period of 2000-2004.

 Wiki Approach Est.                                                                  Wiki Application Est.
 Probability & Chaos theory                                                          Nonlinear control & auto. planning and scheduling
 ML & Behavior based AI                                                              Intelligent agent & ML
 ML & Fuzzy systems                                                                  ML & auto. planning and scheduling
 Probability & ML                                                                    auto. planning and scheduling & auto. reasoning
 ML & Behavior based AI                                                              Computer audition & NLP
 Fuzzy systems & ML                                                                  CV and subfields & Automatic target recognition
 Early cybernetics and brain simulation & Behavior based AI                          Computational creativity & auto. reasoning




                                                                                60
                      EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



EEKE2020, August 1, 2020., Wuhan, China                                                           Sahand Vahidnia, Alireza Abbasi, and Hussein A. Abbass




Figure 7: Sankey diagram with Wikipedia applications as labels. The initial digits are referring to the ending year of period.

                      Table 3: Wikipedia based approaches and applications (topics) for the period of 2005-2007.

 Wiki Approach Est.                                                       Wiki Application Est.
 Probability & Chaos theory                                               auto. planning and scheduling & Nonlinear control
 ML & Fuzzy systems                                                       auto. planning and scheduling & ML
 ML & Behavior based AI                                                   Computer audition & NLP
 ML & Behavior based AI                                                   auto. reasoning & ES
 ML & Probability                                                         auto. planning and scheduling & ML
 Behavior based AI & ML                                                   Bio-inspired computing & CV
 ML & Behavior based AI                                                   Intelligent agent & ES
 ML & Evolutionary computation                                            PR & ML

                      Table 4: Wikipedia based approaches and applications (topics) for the period of 2008-2010.

 Wiki Approach Est.                                                       Wiki Application Est.
 ML & Fuzzy systems                                                       ML & PR
 Behavior based AI & ML                                                   KM & Intelligent agent
 Probability & Chaos theory                                               Nonlinear control & auto. planning and scheduling
 Fuzzy systems & Chaos theory                                             auto. planning and scheduling & Intelligent agent
 ML & Probability                                                         ML & Intelligent agent
 Fuzzy systems & ML                                                       Nonlinear control & Automatic target recognition
 Chaos theory & Probability                                               auto. planning and scheduling & Nonlinear control
 Evolutionary computation & Early cybernetics and brain simulation        Hybrid intelligent system & Bio-inspired computing
 ML & Fuzzy systems                                                       auto. planning and scheduling & ML




                                                                     61
                       EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents



Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping                                  EEKE2020, August 1, 2020., Wuhan, China


                      Table 5: Wikipedia based approaches and applications (topics) for the period of 2011-2013.

 Wiki Approach Est.                                                                  Wiki Application Est.
 ML & Behavior based AI                                                              Intelligent agent & ML
 Chaos theory & Probability                                                          auto. planning and scheduling & Nonlinear control
 Early cybernetics and brain simulation & Behavior based AI                          KM & Decision support system
 Probability & ML                                                                    auto. planning and scheduling & auto. reasoning
 Chaos theory & Probability                                                          auto. planning and scheduling & AI in video games
 ML & Behavior based AI                                                              NLP & ML
 ML & Fuzzy systems                                                                  auto. planning and scheduling & ML
 ML & Fuzzy systems                                                                  PR & Intelligent control
 Probability & Chaos theory                                                          Nonlinear control & auto. planning and scheduling
 ML & Chaos theory                                                                   PR & ML
 Evolutionary computation & ML                                                       Nonlinear control & PR

                      Table 6: Wikipedia based approaches and applications (topics) for the period of 2014-2016.

 Wiki Approach Est.                                                                  Wiki Application Est.
 Behavior based AI & Early cybernetics and brain simulation                          Bio-inspired computing & Decision support system
 Probability & Chaos theory                                                          auto. planning and scheduling & Nonlinear control
 Evolutionary computation & AI                                                       AI & PR
 AI & Chaos theory                                                                   CV and subfields & Automatic target recognition
 Fuzzy systems & Chaos theory                                                        auto. planning and scheduling & Nonlinear control
 AI & Behavior based AI                                                              NLP & AI
 AI & Probability                                                                    auto. planning and scheduling & auto. reasoning
 AI & Probability                                                                    Intelligent agent & Strategic planning
 Probability & Chaos theory                                                          Nonlinear control & auto. planning and scheduling
 AI & Fuzzy systems                                                                  PR & Nonlinear control
 AI & Fuzzy systems                                                                  auto. planning and scheduling & AI

                      Table 7: Wikipedia based approaches and applications (topics) for the period of 2017-2019.

 Wiki Approach Est.                                                                  Wiki Application Est.
 ML & Fuzzy systems                                                                  ML & PR
 Probability & Chaos theory                                                          PR & Nonlinear control
 Probability & Chaos theory                                                          auto. planning and scheduling & auto. reasoning
 Fuzzy systems & Behavior based AI                                                   Automation & Vehicle infrastructure integration
 Behavior based AI & ML                                                              CV & Computer audition
 ML & Fuzzy systems                                                                  auto. planning and scheduling & ML
 Early cybernetics and brain simulation & ML                                         DSS& KM
 ML & Behavior based AI                                                              ML & Computer audition
 Probability & Chaos theory                                                          auto. planning and scheduling & AI in video games
 ML & Probability                                                                    ML & Intelligent agent
 ML & Behavior based AI                                                              Intelligent agent & ES




                                                                                62