=Paper= {{Paper |id=Vol-2831/paper2 |storemode=property |title=Unsupervised Key-phrase Extraction and Clustering for Classification Scheme in Scientific Publications |pdfUrl=https://ceur-ws.org/Vol-2831/paper2.pdf |volume=Vol-2831 |authors=Xiajing Li,Marios Daoutis |dblpUrl=https://dblp.org/rec/conf/aaai/LiD21 }} ==Unsupervised Key-phrase Extraction and Clustering for Classification Scheme in Scientific Publications== https://ceur-ws.org/Vol-2831/paper2.pdf
 Unsupervised Key-phrase Extraction and Clustering for Classification Scheme in
                           Scientific Publications
                                                  Xiajing Li,1, 2 Marios Daoutis,1
                                                               1
                                                                 Ericsson AB,
                                     2
                                         Department of Linguistic and Philology, Uppsala University
                                          xiajing.li@ericsson.com, marios.daoutis@ericsson.com


                            Abstract                                    umes of scientific documents becomes quite challenging and
                                                                        time-consuming (Carver et al. 2013).
  A Systematic Review of a research domain provides a way                  In the classical systematic mapping procedure, keyword
  to understand and structure the state-of-art of a particular
  research area. Extensive reading and intensive filtering of
                                                                        extraction & classification scheme are two essential steps
  large volumes of publications are required during that pro-           that help in classifying papers in different perspectives while
  cess, while almost exclusively performed by human experts.            producing a group of categories from, typically, manual key-
  Automating sub-tasks from the well defined Systematic Map-            wording and grouping of the descriptive terms. First, terms
  ping (SM) and Systematic Review (SR) methodologies is not             extracted by intensively reading papers should be common
  well explored in the literature, despite recent advances in           in regard each source document as well as the research do-
  natural language processing techniques. Typical challenges            main. Existing keywords and key-phrase extraction systems
  evolve around the inherent gaps in the semantic understand-           are usually independent, concerning downstream tasks and
  ing of text and the lack of domain knowledge necessary to             types of documents. For document types, such as web pages
  fill-in that gap. In this paper, we investigate possible ways of      and social media documents, short and concise keywords are
  automating common sub-tasks of the SM/SR process, i.e., ex-
  tracting keywords and key-phrases from scientific documents
                                                                        required, while multi-word expressions (key-phrases) are
  using unsupervised methods, which are then used as a basis to         more common in scientific publications.
  construct the so-called classification scheme using semantic             In this work we explore methods that can leverage the
  clustering techniques. Specifically, we explore the effect of         identified and automatically extracted keywords for produc-
  ensemble scores in key-phrase extraction, semantic network-           ing a classification scheme for the research domain of in-
  based word embeddings as well as how clustering can be used           terest. Furthermore, we evaluate methods suitable for ex-
  to group related key-phrases. We conducted an evaluation              tracting representative (as an attribute of each document)
  on a dataset from publications on the domain of “Explain-             and highly relevant (to a target research domain) keywords
  able AI” which we constructed from standard, publicly avail-          drawn from the summary (abstract) of scientific publica-
  able digital libraries and sets of indexing terms (keywords).         tions. We are interested in getting keywords and key-phrases
  Results show that ensemble ranking score does improve the
  key-phrase extraction performance. Semantic network-based
                                                                        that are precise yet informative as domain concepts or termi-
  word embeddings (ConceptNet) has similar performance as               nologies.
  contextualized word embeddings, while the former is more                 Hence, we attempt to address whether automated key-
  efficient than the latter. Finally, semantic term clustering can      phrase extraction methods and term clustering techniques
  group similar terms, which can be suitable for classification         can adequately extract and identify useful information, com-
  schemes.                                                              parable to how they are performed in the context of SM
                                                                        & SR. More specifically, we explore the effect of ensem-
                                                                        ble score measures in key-phrase extraction (Q1), the effect
                     1     Introduction                                 of semantic network-based word embedding techniques in
Systematic Mapping (SM) and Systematic Review (SR)                      embedding representation of phrase semantics (Q2), as well
studies are standard methods for capturing the state-of-art             as the effect of clustering for grouping semantically related
of a particular research field in a structured and organised            key-phrases (Q3). Our code and data will be publicly avail-
way, while at the same time provide significant insights and            able at: https://github.com/xiajing10/akec.
knowledge around that research area (Petersen et al. 2008).
Traditionally, these methods are performed manually by hu-                                 2    Related Work
man experts and researchers. With a growing number of pub-
                                                                        With an increasing number of research publications, es-
lications in recent years as well as the literature expansion in
                                                                        pecially in artificial intelligence, current systematic map-
novel areas, the systematic mapping procedure of such vol-
                                                                        ping underlying procedures are time-consuming. The sur-
Copyright c 2021for this paper by its authors. Use permitted un-        vey from Carver et al. discusses the barriers of manual work
der Creative Commons License Attribution 4.0 International (CC          in the systematic literature review process, especially in the
BY 4.0).                                                                context of paper selection and data extraction (Carver et al.
2013). Recent text-mining algorithms and NLP techniques         the help of domain experts, and finally use knowledge bases
can become particularly useful for automating (parts of)        to select and classify the primary studies automatically (Os-
this manual work within the systematic mapping studies          borne et al. 2019). Their classification scheme is generated
procedure. Several studies have investigated various tech-      by selecting several ontologies from author-given keywords
niques to automate one or more sub-steps, such as paper         as categories and identified equivalent ontologies (based on
selection (Marshall and Wallace 2019). However, we find         relations learned in ontology learning) as they appeared in
that very few related works focus on automating the pro-        abstracts, keywords, and titles. Their approach shows higher
cess of keywording and categorization steps, which presume      precision compared to TF-IDF. However, it relies on an
background knowledge from domain experts. Extracted key-        extensive, extracted database of ontologies of author-given
words have to encode salient (essential and relevant) text      keywords, which are sometimes missing in attributes. Un-
features and the aspect of human readability (as concepts).     like the methods discussed above, our method is inspired by
Then, when grouping sets of keywords into different cate-       taxonomy generation by term clustering, which focuses on
gories, human experts have an inherent ability to understand    grouping words/terms similarity based on their representa-
the definition, background knowledge, and semantic related-     tion. Using taxonomy as a classification scheme would be
ness of keywords.                                               more suitable in immature or evolving domains than clas-
   Keyword extraction generates highly representative and       sification with fixed classes (Usman et al. 2017). Liu et al.
relevant information from unstructured text, used as features   construct taxonomy from keywords using hierarchical clus-
in many downstream tasks, such as summarization, cluster-       tering (Liu et al. 2012). Zhang et al. generates taxonomy
ing, knowledge graph generation, and taxonomies. Unsu-          using spherical k-means to cluster terms extracted from a
pervised systems typically apply scoring and ranking meth-      large-scale set of publications from the domain of computer
ods on candidate words. TF-IDF is a simple but effective        science, with word embeddings learned from the text. Con-
scoring mechanism. Graph-based methods (e.g., TextRank          sidering that a large corpus is not always obtainable, we
(Mihalcea and Tarau 2004)) rank the importance of words         first apply keyword extraction to extract terms (Zhang et al.
based on word co-occurrence graph, which has shown its          2018).
effectiveness independently of domain and language. Se-
mantic information of words is rarely used in early meth-                           3    Methodology
ods, as it is usually difficult to measure. Word embed-
ding techniques provide a means to measure such seman-          Our automation method follows the pipeline of classifica-
tic similarity. Semantic similarity between each candidate      tion scheme generation (Franzago et al. 2016). It is com-
and its source document can be calculated by cosine simi-       posed of two modules: (1) key-phrase extraction from titles
larity of their embedding representation. Papagiannopoulou      and abstracts; (2) term clustering to identify key-phrases cat-
and Tsoumakas utilize averaging GloVe word embedding as         egories. Our system’s overall framework is shown in Fig.
phrase vector and “theme vector” (Papagiannopoulou and          1, leveraging a semantic similarity measure and external
Tsoumakas 2020). Bennani-Smires et al. applies Doc2Vec          knowledge from pre-trained word embedding.
and Sent2Vec for document representation and phrase repre-
sentation (Bennani-Smires et al. 2018). Sun et al. combined
various contextualized word embedding methods with SIF
weighted sentence embedding model (Sun et al. 2020). In
this paper, we further explore the performance of semantic
network based word embeddings building on the work of
SIFRank.
   A pre-existing classification scheme, typically, does not
always fit more than one particular research domain. Updat-
ing or generating a new classification scheme from selected
papers is widely applied in most cases, with help from text-
mining techniques. Terko, Žunić, and Donko conducted con-
ference paper classification using traditional machine learn-
ing methods, with labels generated from topic modeling
(Terko, Žunić, and Donko 2019). Kim and Gil applied k-
means as an unsupervised clustering method for creating the
classification scheme at a document-level, during which they
extracted features from topic models, abstracts and author-
given keywords, followed by TF-IDF vectorization and doc-
ument clustering (Kim and Gil 2019). Different from cate-
gories in systematic mapping studies, document clustering
is single-faceted, where each article is assigned to only one
category. Osborne et al. proposed their semi-supervised sys-
tem for mapping studies, which starts with ontology learning      Figure 1: Framework of proposed automation method.
over large scholarly datasets, then refines the ontology with
3.1   Key-phrase Extraction                                          Step 3. Domain relevance score of one candidate
The key-phrase extraction module is built based on SIFRank                   phrase is the average of top N (50% in our ex-
(Sun et al. 2020), a state-of-art embedding-based method,                    periment) highest similarity scores.
whose pipeline consists of (1) candidates selection by noun
phrase chunking and (2) candidates ranking by candidate-           Phrase Quality Score In scientific documents, high-
document cosine similarity. We use the SIFRank score to            quality phrases are usually multi-word expressions or uni-
measure document relevance, together with two other scor-          grams as an acronym, representing common or newly de-
ing functions for measuring domain relevance and phrase            fined scientific concepts. Therefore, our method considers
quality. The three scores are combined for candidate key-          this fact and defines the quality score of a term according to
phrases ranking.                                                   length penalty, point-wise mutual information (PMI), left-
                                                                   right information entropy strategy, and acronym informa-
Document relevance score One keyword of a single doc-              tion. The length penalty aims to reduce the score of uni-
ument should have a strong connection with this document.          grams and long phrases. Based on the analysis of the scien-
Semantic distance with word embedding is based on the              tific documents dataset, the majority of gold key-phrases are
principle that as closer a candidate vector is to the document     bi-grams and tri-grams. Hence, we added length penalty to
vector, the closest the distance is in regard to their mean-       multi-word expression t that contains more than three words
ings. The effectiveness of the semantic distance measure has       as length score(t) = −0.5 ∗ klength(t) − 3k. However,
been previously evaluated in benchmark datasets (Bennani-          acronyms are extensively used as a shorter format (mostly
Smires et al. 2018). SIFRank(Sun et al. 2020) reaches state-       uni-grams) of long scientific terms. Since acronym usually
of-art performance in key-phrase extraction for short docu-        refers to a specific terminology or scientific concept in the
ments, while utilizing auto-regressive pre-trained Language        document, it is a good indicator of whether the term is im-
Model ELMo to produce word embedding and SIF (Smooth               portant or not. Therefore, the length penalty does not ap-
Inverse Frequency) (Arora, Liang, and Ma 2017) to gener-           ply to uni-grams that are identified as acronyms. The well-
ate unsupervised sentence embedding. In scientific publica-        known PMI and entropy strategies are used to extract multi-
tions, representative key-phrases frequently appear in titles.     word expressions that co-occur frequently and contain a col-
Each candidate’s final document relevance score is the orig-       lective meaning. Generally, a high PMI score indicates a
inal score weighted according to the candidates that appear        high probability of co-occurrence. We calculated the min-
in the title. The weight is defined by the length of the tokens    imum PMI score among all two segments of the expression
of candidate phrases.                                              for expressions that contain more than two words. For ex-
                                                                   ample, the score of “explainable artificial intelligence” is
                                                                   equal to the minimum score of PMI(x=explainable machine,
Domain Relevance Score Finding domain-specific terms               y=learning) and PMI(x=explainable, y=machine learning).
has been a challenge for novel domains with fewer re-              Left-right information entropy (Eq. 1) shows the variety
lated resources (publications). Terms with high frequency          of word context of a candidate phrase and adjacent words
in domain-specific corpus and low frequency in other do-           will be widely distributed if the string (candidate phrase) is
mains can be considered domain-specific terms. In contrast,        meaningful, and they will be localized if the string is a sub-
without a domain-specific corpus, dictionary-based valida-         string of a meaningful string (Shimohata, Sugio, and Nagata
tion can help to improve finding representative terms. Struc-      1997).
tured semantic resources (e.g., WordNet) can help in utiliz-
ing semantic relations, such as groups of synonyms or topic-                               X
                                                                              H(t) = −             p(wi |t) log2 p(wi |t)      (1)
based clusters, assuming that related terms are more likely
                                                                                          wi ∈wl
to be critical than isolated ones (Firoozeh et al. 2020). In the
general systematic mapping studies process, glossary dic-          where wl represent the list of adjacent words of candidate
tionary and domain seed key-phrases are provided with the          phrase t. Both left and right sides of phrase t is calculated
help of human experts. Here we collect our domain glos-            and the lower one is selected as the final information entropy
sary terms from open knowledge graph databases: (1) artifi-        score.
cial intelligence knowledge graph (Dessı et al. 2020) using           In detail, the quality score of a candidate term t is the sum
terms with direct link connections to “artificial intelligence”;   of PMI-entropy score and length penalty . To weaken PMI’s
(2) machine learning taxonomy from Aminer (Tang 2016).             bias towards low frequency words, we filter out candidate
Semantic similarity between candidates and glossary terms          terms with low PMI score (threshold at PMI = 2 in our ex-
are calculated for relevance scoring. Detailed steps are de-       periments) and use the normalized entropy score of the rest
scribed below:                                                     candidate terms as PMI-entropy score.
  Step 1. Candidate key-phrases and domain glossaries
          are transformed by pre-trained word embed-               3.2   Key-phrase Clustering
          ding.                                                    Clustering aims to identify distinct groups in a dataset and
  Step 2. For each candidate phrase, scosine similarity            assign a group label to each data point. This module fo-
          is calculated between itself and each domain             cuses on clustering key-phrase based on their semantic sim-
          glossary.                                                ilarity (cosine similarity of their embedding representation).
                                                                                                             average #nums   average #count
For this module we tested two clustering algorithms: spheri-                               total   in text
                                                                                                                of tokens       per paper
cal k-means and hierarchical agglomerative clustering. As           Non-Controlled terms   3200    88.84%        2.6181         11.1888
bottom-up clustering, agglomerative clustering starts with            Controlled terms     1536    20.73%        2.1978          4.7727
each data point as an individual cluster and then merges sub-
clusters into one super-cluster based on a certain distance      Table 1: Comparative analysis of Non-Controlled indexing
threshold. Spherical k-means is k-means on a unit hyper-         terms and Controlled inxdexing terms.
sphere, where (1) all vectors are normalized to unit-length
and (2) objective function is to minimize cosine distance be-
tween vectors. Studies have found the effectiveness of co-       removal would affect acronym extraction, tokenization, and
sine similarity in quantifying the semantic similarities be-     noun phrase chunking. Also, noun phrases with dash tag will
tween high dimensional data such as word embedding, as the       lead to a low recall of correct candidates. Thus, we remove
direction of a vector is more important than the magnitude       applied dash tags and use an extended set of common stop-
(Strehl et al. 2000). Comparing to standard k-means, spher-      words1 .
ical of k-means matches the distinct nature of cosine simi-
larity measure in words embedding high dimensional space.
                                                                 Candidate Selection Candidate selection is built under
Zhang et al. illustrates that when using spherical k-means for
                                                                 the framework of SIFRank2 model, where tokenizer and
topic detection, the center direction acts as a semantic focus
                                                                 POS tagger have been changed to SpaCy. Noun phrase pat-
on the unit sphere, and the member terms of that topic fall
                                                                 tern (defined as in Eq. 2) is captured by regular expressions
around the center direction to represent a coherent semantic
                                                                 and parsed into constituency tree for pattern matching.
meaning (Zhang et al. 2018).

           4    Experimental Evaluation                                           < N N. ∗ |JJ > ∗ < N N.∗ >                              (2)
This section presents our experimental evaluation setup for         Acronym Extraction is implemented directly using build-
our proposed automation approach. We aim at answering the        in function in ScispaCy. Considering that acronym are case-
following questions:                                             sensitive, we implemented acronym extraction before pre-
• Q1: Can our ensemble scoring measure improve perfor-           processing.
  mance in domain-specific key-phrase extraction?
• Q2: How does semantic network based word embedding             Candidate Ranking Details of the candidate scoring pro-
  techniques (ConceptNet) perform in embedding represen-         cess are illustrated above. The latest version of pre-trained
  tation of phrase semantics?                                    ConceptNet numberbatch (ConceptNet Numberbatch 19.08,
• Q3: Does the clustering method group semantically re-          English version) is used as pre-trained word embedding for
  lated key-phrases for identifying categories?                  embedding representation. Our domain glossary terms are
                                                                 selected from the open resources knowledge graph database:
4.1   Data                                                       (1) artificial intelligence knowledge graph3 (Dessı et al.
                                                                 2020): terms with direct link connection with the term “ar-
Data collection determines the quality and relevance of the      tificial intelligence” are extracted; (2) machine learning tax-
further steps of systematic mapping studies. As keyword-         onomy from Aminer4 (Tang 2016).
ing follows after the step of paper selection, we assume
that the selected input articles under consideration for our
framework are considered to be already in-domain. How-           Selection of Key-phrases Before moving forward to the
ever, common benchmark datasets for key-phrase extraction        clustering module, post-processing controls the quality of
from scientific articles do not focus on a specific research     the extracted key-phrases to match the use case. We defined
domain. We collected a set of scientific articles from IEEE      a few rule-based steps for post-processing:
Xplore under the domain of “Explainable Artificial Intelli-
gence”. In total, 286 scientific publications were extracted     1. Lemmatize key-phrases to remove redundant key-phrase
together with their meta-data attributes, which we name XAI         due to language inflection. The higher score between the
dataset. “Title” and “abstract” of each article were combined       two will be assigned.
as input text. Also, IEEE Xplore provides INSPEC indexing        2. Average rank of key-phrases among documents. Key-
terms assigned by human experts to represent a publication’s        phrases ranked above 15 are selected.
content. For the evaluation of the key-phrase extraction, we
use the “INSPEC Non-Controlled Indexing terms” attribute         3. Replace key-phrase identified as an acronym by its origi-
as a gold standard, as its terms are primarily emerge from          nal definition in text.
text.                                                            4. Remove last 20% key-phrases based on TF-IDF scores.
4.2   Implementation and Tools                                      1
                                                                      Stopwords list from https://www.ranks.nl/stopwords
                                                                    2
Pre-processing The title and abstract of each document                https://github.com/sunyilgdx/SIFRank
                                                                    3
are concatenated as input text. Initial experiments on candi-         http://scholkg.kmi.open.ac.uk/
                                                                    4
dates selection recall found that lowercase and punctuation           https://www.aminer.cn/data
Clustering Algorithms Clustering module is built on                tion”) as well as terms with abstract meanings (“explana-
scikit-learn (Pedregosa et al. 2011) and spherecluster5 . Be-      tion method”). However, it still has limitation on nested key-
fore clustering, each term will be transformed to embedding        phrases with similar meanings (“black box decision mak-
representation from ConceptNet Numberbatch. We first ex-           ing” and “black box”) and wrong candidates from selection
plore the optimal k in range from 5 to 100 clusters.               (“method outperforms”).

4.3    Evaluation Metrics
We evaluated our automation method using two criteria: reli-
ability of extracted key-phrases and the quality of generated
categories based on key-phrases. Evaluation is conducted
separately on two modules. Evaluation of ranked key-phrase
list used traditional statistical measures of Precision, Recall,
and F1-score with the labeled gold standard. Morphologi-
cal variants of phrases have been removed before evalua-
tion. Evaluation of semantic term clustering lacked a ground
truth classification scheme. We utilized an internal evalua-
tion metric of the silhouette coefficient score to measure how
well the cluster is separated.

                        5    Results
                                                                     Figure 2: Example from top-15 extracted key-phrases.
To investigate the feasibility of our proposed automation
method, we conducted experiments on different settings:
(1) combined scoring and ranking for unsupervised key-             5.2   Word Embedding (Q2)
phrase extraction; (2) embedding representation; (3) cluster-
ing methods.                                                       Pre-trained embeddings are utilized for sentence and phrase
                                                                   representation in our method. For ConceptNet embedding,
5.1    Combined Scoring in Key-phrase Extraction                   each phrase is segmented by the longest matching terms
       (Q1)                                                        in the embedding index and encoded by average embed-
                                                                   ding vectors. Since ELMo encodes phrases token by to-
For key-phrase extraction, we compared combined scoring            ken, we take the mean vector of all tokens in the phrase.
method with four base models. One is TextRank6 (Mi-                Comparing the three settings of pre-trained word embed-
halcea and Tarau 2004), a graph-based keyword extrac-              ding used in SIFRank model, both SIFRank-ELMo and
tion module. The other two are SIFRank-ELMo, SIFRank-              SIFRank-ConceptNet based models present similar perfor-
Bert and SIFRank-ConceptNet, where the difference lies in          mance, while SIFRank-ELMo has slightly higher (Table 2).
the underlying pre-trained word embedding representation.             However, contextualized models as ELMo and Bert re-
Our key-phrase extraction method is the extension of base          quire much more execution time than ConceptNet (Table 4).
models by combined scoring and ranking with two other              Here it is worth noting that ELMo and Bert generate embed-
scores. We optimized the scores’ weights based on evalu-           dings from large natural language text corpus, while Con-
ation and set weights to 0.1 for both domain relevance and         ceptNet embeddings are generated from semantic network.
phrase quality. Experimental results form table 2 show that        However, our previous key-phrase extraction results do not
combined scoring methods outperform their original base            show a large difference between ELMo based and Concept-
models in three settings (TextRank, SIFRank-ELMo and               Net based methods. Therefore, the NumberBatch embed-
SIFRank-ConceptNet), where SIFRank-Bert only performs              dings based on ConceptNet are more efficient for short term
better than baseline in Top10 and Top15 key-phrases. Table         extraction.
3 also shows positive effect when adding two scores to base-
lines. Meanwhile, the quality score shows larger impact than       5.3   Clustering (Q3)
domain relevance. We think it is because domain relevance
score is sensitive to the quality of domain glossaries. Also,      In the clustering module, each key-phrase is treated as an in-
good key-phrases in scientific literature usually contain sim-     dependent ontological concept term. Term-level clustering
ilar structure, e.g., multi-word expression. It also indicates     group terms together based on cosine similarity of embed-
that filtering out ’poor’ candidate phrases can largely con-       ding from ConceptNet NumberBatch. Spherical k-means
tribute to better extracting performance.                          and hierarchical agglomerative clustering (HAC) are eval-
   From the example of top-15 extracted key-phrases (in Fig.       uated in our clustering module. HAC uses average linkage
2, adding domain relevance and phrase quality could re-            and cosine distance. For clustering experiments on the XAI
duce the rank of uni-grams (“method”, “logic”, “explana-           dataset, terms are selected from the best model in key-phrase
                                                                   extraction experiment, with key-phrases post-processing and
   5
    https://pypi.org/project/spherecluster/                        cleaning discussed above.
   6                                                                  Silhouette score in Figure 3 shows that the curve of ag-
    Implemented on pke python library (https://github.com/
boudinfl/pke)                                                      glomerative clustering does not reach a peak within range
                                                               Top5                        Top10                         Top15
                                                    P            R         F1        P        R        F1         P         R        F1
                              Baseline           0.4986       0.2228     0.3080   0.4411   0.3941    0.4162    0.3791    0.5066    0.4337
        TextRank
                           Combined Scoring      0.5203       0.2325     0.3214   0.4627   0.4134    0.4367    0.3793    0.5069    0.4339
                              Baseline           0.5105       0.2281     0.3153   0.4327   0.3866    0.4083    0.3803    0.5072    0.4347
    SIFRank-ELMo
                           Combined Scoring      0.5469       0.2444     0.3378   0.4834   0.4319    0.4562    0.4152    0.5538    0.4746
                              Baseline           0.5266       0.2353     0.3253   0.4418   0.3947    0.4169    0.3796    0.5063    0.4339
      SIFRank-Bert
                           Combined Scoring      0.5147       0.2300     0.3179   0.4530   0.4047    0.4275    0.3993    0.5325    0.4563
                              Baseline           0.5049       0.2256     0.3119   0.4257   0.3803    0.4017    0.3679    0.4906    0.4205
  SIFRank-ConceptNet
                           Combined Scoring      0.5510       0.2463     0.3404   0.4774   0.4266    0.4506    0.4103    0.5471    0.4689

             Table 2: Comparison of key-phrase extraction results from ensemble methods with three base models.

                                                     Top10
                       Baseline             0.4327   0.3866     0.4083
                       + Domain Relevance   0.4446   0.3972     0.4195
    SIFRank-ELMo
                         + Phrase Quality   0.4809   0.4297     0.4539
                        Combined Scoring    0.4834   0.4319     0.4562
                       Baseline             0.4418   0.3947     0.4169
                       + Domain Relevance   0.4372   0.3906     0.4126
     SIFRank-Bert
                         + Phrase Quality   0.4425   0.3953     0.4176
                        Combined Scoring    0.4442   0.3969     0.4192
                       Baseline             0.4257   0.3803     0.4017
                       + Domain Relevance   0.4404   0.3934     0.4156
 SIFRank-ConceptNet
                         + Phrase Quality   0.4823   0.4309     0.4552
                        Combined Scoring    0.4774   0.4266     0.4103

Table 3: Comparison of the impact of three different scores
to key-phrase performance.

                                           Time                            Figure 3: Results of silhouette coefficient score with n clus-
                SIFRank-ELMo             1650.81                           ters.
              SIFRank-ConceptNet          258.13

Table 4: Execution time (in seconds) of two key-phrase ex-
traction methods, including loading embedding.


of 100 clusters, while Spherical k-means reach its highest
score at 89 clusters. Also, spherical k-means gets better clus-
ter quality than hierarchical agglomerative clustering, which
is also proved by results in Table 6.
   Theoretically, silhouette score ranges from -1 to 1, where
1 indicates better separation among clusters and 0 means
overlapping between clusters. Even though both clustering
algorithms do not reach highly significant silhouette score,
analysis of clusters output proves semantic coherence of
terms within clusters (Fig. 4 and Table 5), which can be
                                                                           Figure 4: Example of Hierarchical Agglomerative Cluster-
identified as semantic categories of these key-phrases. Ta-
                                                                           ing Dendrogram.
ble 5 selects four example clusters, where terms in table are
ranked by its distance to its cluster center. Clusters in the Ta-
ble shows categories of “visual analytic” (cluster 1), “object                Possible reasons for such results could be due to: (1)
detection” (cluster 2), “white box” (cluster 3) and “fuzzy                 we may encounter a limitation in regard to the embed-
system” (cluster 4). By manually analyzing created clusters,               ding representation of terms. Fine-tuning ConceptNet em-
some observation can be made that:                                         beddings would require a network of domain-specific on-
                                                                           tologies; thus, it is not applicable in our research. Pre-trained
• Terms within one cluster show high similarity in sub-                    embedding may have limited discriminative power in a spe-
  words, while sometimes the same sub-words indicate se-                   cific domain; (2) a limitation is considered concerning the
  mantic relatedness.                                                      clustering algorithms. Generic clustering algorithms assume
• Central meaning “word” represents the topic or category                  data points can be separated. Internal evaluations also mea-
  found in the cluster, which further determines whether it                sure the separation of clusters. We notice that clusters over-
  can be used as a part of classification scheme.                          lap in the embedding space; thus, clustering may not be able
                1                                  2                                3                                4
        visual analytics                   object detection                     white box                     fuzzy system
   visual analytics workflow           object detection system             white box solution           hierarchical fuzzy system
      visual analytics tool          object detection framework            white box method             fuzzy system complexity
  visual analytics framework        interpretable object detection     black box decision making       evolutionary fuzzy system
  visual analytics researcher           robust object detection       equivalent white box solution        neuro fuzzy system
   visual analytics paradigm      occlusion robust object detection         black box nature           interpretable fuzzy system
    visual analytics solution       semantic object part detector         black box prediction                fuzzy method

             Table 5: Example of cluster-wise results on Spherical k-means clustering of XAI publications dataset.


                                   silhouette score                   Bennani-Smires, K.; Musat, C.; Jaggi, M.; Hossmann, A.;
            spherical k-means           0.1615                        and Baeriswyl, M. 2018. Embedrank: Unsupervised
                  HAC                   0.0690                        keyphrase extraction using sentence embeddings. ArXiv
                                                                      abs/1801.04470.
Table 6: Clustering analysis on XAI publications dataset.
                                                                      Carver, J. C.; Hassler, E.; Hernandes, E.; and Kraft, N. A.
Number of clusters is set to 74.
                                                                      2013. Identifying barriers to the systematic literature re-
                                                                      view process. In 2013 ACM / IEEE International Sympo-
to separate clusters.                                                 sium on Empirical Software Engineering and Measurement,
                                                                      203–212.
                                                                      Dessı, D.; Osborne, F.; Recupero, D. R.; Buscaldi, D.;
                        6    Conclusion                               Motta, E.; and Sack, H. 2020. Ai-kg: an automatically gen-
This paper proposes a joint framework of unsupervised key-            erated knowledge graph of artificial intelligence. In Interna-
phrase extraction and semantic term clustering to automate            tional Semantic Web Conference.
systematic mapping studies. Experiments are conducted us-             Firoozeh, N.; Nazarenko, A.; Alizon, F.; and Daille, B. 2020.
ing publications from the domain of Explanable Artificial             Keyword extraction: Issues and methods. Natural Language
Intelligence (XAI)”. In detail, we examined the ensemble              Engineering 26(3):259–291.
ranking scores, ConceptNet word embedding, and cluster-
ing performance.                                                      Franzago, M.; Ruscio, D. D.; Malavolta, I.; and Muccini,
   Results in key-phrase extraction demonstrate the effec-            H. 2016. Protocol for a systematic mapping study on
tiveness of ensemble ranking scores from different perspec-           collaborative model-driven software engineering. CoRR
tives, where domain knowledge (in terms of glossaries and             abs/1611.02619.
domain corpus) finds highly relevant terms which can be               Kim, S.-W., and Gil, J.-M. 2019. Research paper classi-
further considered as constraints and external resources for          fication systems based on tf-idf and lda schemes. Human-
weak supervision. ConceptNet based word embedding per-                centric Computing and Information Sciences 9(1):30.
forms as well as contextualized word embeddings, with                 Liu, X.; Song, Y.; Liu, S.; and Wang, H. 2012. Automatic
much less execution time. Findings are further useful to              taxonomy construction from keywords. In Proceedings of
guide the choice of a suitable word embedding method in               the 18th ACM SIGKDD international conference on Knowl-
terms of tasks and use cases. Semantic term clustering can            edge discovery and data mining, 1433–1441.
group semantically similar terms within clusters, still we
suggest some minimal human involvement may help re-                   Marshall, I. J., and Wallace, B. C. 2019. Toward systematic
fine and select high-quality keywords clusters based on use           review automation: a practical guide to using machine learn-
cases, with the bulk of the work been primarily performed             ing tools in research synthesis. Systematic reviews 8(1):163.
by the algorithm.                                                     Mihalcea, R., and Tarau, P. 2004. TextRank: Bringing order
   Above all, we hope our research can give a new perspec-            into text. In Proceedings of the 2004 Conference on Em-
tive of automating keywording and classification scheme               pirical Methods in Natural Language Processing, 404–411.
steps in systematic mapping studies towards faster and more           Barcelona, Spain: Association for Computational Linguis-
convenient solutions in an open research knowledge era. For           tics.
future work, the role of human involvement can be further             Osborne, F.; Muccini, H.; Lago, P.; and Motta, E. 2019.
evaluated with having specific use cases in mind. Finally,            Reducing the effort for systematic reviews in software engi-
ontology-related techniques could be explored as means to             neering. ArXiv abs/1908.06676.
refining keywords.
                                                                      Papagiannopoulou, E., and Tsoumakas, G. 2020. A review
                                                                      of keyphrase extraction. Wiley Interdisciplinary Reviews:
                            References                                Data Mining and Knowledge Discovery 10(2):e1339.
Arora, S.; Liang, Y.; and Ma, T. 2017. A simple but tough-            Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
to-beat baseline for sentence embeddings. In International            Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
Conference on Learning Representations (ICLR).                        R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
learn: Machine learning in Python. Journal of Machine
Learning Research 12:2825–2830.
Petersen, K.; Feldt, R.; Mujtaba, S.; and Mattsson, M. 2008.
Systematic mapping studies in software engineering. Pro-
ceedings of the 12th International Conference on Evaluation
and Assessment in Software Engineering 17.
Shimohata, S.; Sugio, T.; and Nagata, J. 1997. Retriev-
ing collocations by co-occurrences and word order con-
straints. In 35th Annual Meeting of the Association for Com-
putational Linguistics and 8th Conference of the European
Chapter of the Association for Computational Linguistics,
476–481.
Strehl, A.; Strehl, E.; Ghosh, J.; and Mooney, R. 2000. Im-
pact of similarity measures on web-page clustering. In In
Workshop on Artificial Intelligence for Web Search (AAAI
2000, 58–64. AAAI.
Sun, Y.; Qiu, H.; Zheng, Y.; Wang, Z.; and Zhang, C. 2020.
Sifrank: A new baseline for unsupervised keyphrase extrac-
tion based on pre-trained language model. IEEE Access
8:10896–10906.
Tang, J. 2016. Aminer: Toward understanding big scholar
data. In Proceedings of the ninth ACM international confer-
ence on web search and data mining, 467–467.
Terko, A.; Žunić, E.; and Donko, D. 2019. Neurips con-
ference papers classification based on topic modeling. In In-
ternational Conference on Information, Communication and
Automation Technologies (ICAT), 1–5.
Usman, M.; Britto, R.; Börstler, J.; and Mendes, E. 2017.
Taxonomies in software engineering: A systematic mapping
study and a revised taxonomy development method. Infor-
mation and Software Technology 85:43 – 59.
Zhang, C.; Tao, F.; Chen, X.; Shen, J.; Jiang, M.; Sadler,
B.; Vanni, M.; and Han, J. 2018. Taxogen: Unsupervised
topic taxonomy construction by adaptive term embedding
and clustering. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data
Mining, 2701–2709.