<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>or journal</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3390/e20020104</article-id>
      <title-group>
        <article-title>Keyword Extraction in Scientific Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Susie Xi Rao</string-name>
          <email>srao@ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piriyakorn Piriyatamwong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parijat Ghoshal</string-name>
          <email>parijat.ghoshal@nzz.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Nasirian</string-name>
          <email>sara.nasirian@supsi.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandra Mitrović</string-name>
          <email>sandra.mitrovic@supsi.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel de Salis</string-name>
          <email>emmanuel.desalis@he-arc.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Wechner</string-name>
          <email>michael.wechner@wyona.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanya Brucker</string-name>
          <email>vanya.brucker@wyona.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Egger</string-name>
          <email>pegger@ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ce Zhang</string-name>
          <email>cezhang@ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Lugano, Switzerland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Applied Economics, ETH Zurich</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Challenge 2: Evaluation of Keyword Extraction</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dalle Molle Institute for Artificial Intelligence</institution>
          ,
          <addr-line>Lugano</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Haute-Ecole Arc</institution>
          ,
          <addr-line>Neuchâtel</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Neue Zürcher Zeitung AG</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Wyona AG</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>2</volume>
      <fpage>265</fpage>
      <lpage>274</lpage>
      <abstract>
        <p>The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications. ten nouns) that capture the main ideas of a given text, such as economics, are required to have author-generated datasets and keyword reference lists, as authors often ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
      <kwd-group>
        <kwd>annotation is dificult to control the quality [ 2</kwd>
        <kwd>3</kwd>
        <kwd>11]</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Keyphrases are single- or multi-word expressions (of</title>
        <p>
          but do not necessarily appear in the text itself [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
          ].
Keyphrases have been shown to be useful for many tasks
in the Natural Language Processing (NLP) domain, such
as (1.) indexing, archiving and pinpointing information
in the Information Retrieval (IR) domain [
          <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
          ], (2.)
document clustering [
          <xref ref-type="bibr" rid="ref3 ref7 ref8">3, 7, 8</xref>
          ], and (3.) summarizing texts
[
          <xref ref-type="bibr" rid="ref3">3, 9, 10, 11</xref>
          ], just to name a few.
ous application domains, ranging from the scientific
community [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2, 12</xref>
          ], finance [ 13, 14], law [15], news media
[11, 16, 17], patenting [18, 19], and medicine [20, 21, 22].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Despite being a seemingly straightforward task for human domain experts, performing automatic keyphrase extraction is a challenging task.</title>
        <p>
          Defining an evaluation protocol and a corresponding
metric is far from trivial for the following reasons.
(1.) We should look at the ground truth list of keywords
in a critical way. As we mentioned above, there can
exist more than one ground truth list of keyphrases
given an abstract. The keyword list provided in
our dataset is a reference list of words provided by
authors or by publishers. One should only treat
do not provide their keyphrase list unless explicitly re- require a large, well-annotated training dataset [16]. The
quested or required to do so [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In scientific publications,
lack of training datasets also poses challenges for the
Keyphrase extraction has been at the forefront of vari- keyphrase list, even when reference lists are not readily
(a) Web of Science.
        </p>
        <p>(b) Google Scholar.
(c) Scopus.</p>
        <p>(d) Microsoft Academic.</p>
        <p>this list as a reference list, but not the one and only
correct list of keywords.
cally equivalent matches [27, 28, 24]. There are other
evaluation methods which account for the ranks and
orders in the extracted keywords, see this Medium
article for inspiration [24].
(2.) There are diferent aims in extracting keyphrases
in system design. As we will introduce in the
rationale of designing the three systems in Section 3,
the systems are designed to tackle various problems
and, therefore, are optimized for diferent use cases.</p>
        <p>System 1 uses a simple TextRank algorithm (see
Section 4), which outputs the most prominent set of
keyphrases/keywords; System 2 uses TextRank on
top of a clustering algorithm (see Section 5), which is
targeted at grouping similar articles and then learns
from the cluster of articles; and System 3 uses
pretrained models and tools on Named-Entity
Recognition (NER) (see Section 6), with a goal to fully utilize
existing models and tools by only pre-processing the
input and/or post-processing the output.</p>
        <p>
          Challenge 3: Growing Number of Scientific
Publications. During the last decades, the number of
scientific publications has increased exponentially each year
[29], making it increasingly challenging for researchers
to keep track of trends and changes, even strictly in their
own field of interest [
          <xref ref-type="bibr" rid="ref3">3, 30</xref>
          ]. This bolsters the need for
automatic keyword extraction for the use case as a text
recommendation and summarization system. The efect
of increasing publications is clearly visible in major
academic search engines such as Google Scholar, Web of
Science, Scopus, and Microsoft Academics. In a simple
query (“data mining”), three out of four failed to bring
up relevant scientific publications that are prominent in
(3.) There are diferent objective functions that we want the field and anticipated by human domain experts.
to optimize. Precision, recall, accuracy, false posi- See the query results in Figure 1 of a keyword search
tive rate, and false negative rate are among the most “data mining” in diferent academic products. We can see
common performance metrics for various applica- that the search results in diferent products vary largely,
tion scenarios [23]. We might also consider the order and it could be dificult for readers to choose between the
of keyphrases, for example, as sorted by criteria such diferent results without having prior knowledge of the
as frequency, TextRank score [24, 25]. In search en- field. So far, only Microsoft Academic Services (Figure 1
gines, the hit rate is also an important metric [26]. (d)) has returned relevant research results that point to
Furthermore, one can evaluate exact matches and the most influential author and work in the field of data
fuzzy matches. Fuzzy matches can also be broken mining. This is because Microsoft Academic Service has
down into two types: “partial” matches and semanti- enabled a hierarchical discipline classification (indexed
by keyphrases) that supports its users when reviewing
the search results. In summary, without relevant and
correct keyphrases, efective indexing and thus querying
is not feasible.
(2.) We introduce three commonly used systems in
academia and industry for keyword extraction. For
the various use cases of keyword extraction, we also
design baseline evaluation metrics for each system.
        </p>
        <p>Challenge 4: Domain-Specific Keyword Extraction. (3.) We encourage participants to discuss, extend, and
Another challenge in keyphrase extraction is its domain- evaluate the systems that we have introduced.
specific nature. One case is when a keyphrase extractor
trained in generic texts may miss out technical terms that System Design of Keyword Extraction. For the
keydo not look like usual keyword noun chunks, such as the word extraction, we provide two systems based on the
chemical name “C4H*Cl” [31]. The issue arises from the unsupervised, graph-based algorithm TextRank [35].
Systokenization step: a non-alphabetic character such as “4” tem 1 (see Section 4) is to develop the TextRank keyword
and “*” might be treated as a separator, and thus such extractor from scratch in order to understand the
reaa keyword gets split into “C”, “H” and “Cl”, losing its soning behind it. System 2 (see Section 5) combines the
original notion. Even if the separator works perfectly, TextRank algorithm with the K-Means clustering
algothis type of chemical name would still confuse keyphrase rithm [36, 37] to provide keyphrases for each specific
extractors that filter candidate keyphrase based on Part- ifeld (“cluster”). In System 3 (see Section 6), we cover
of-Speech (POS) tags. This is because for POS-based the NER task, where an entity in the sentence is
identiextractors, it is unclear whether “C4H*Cl” is an adjective, fied as person, organization, and others from predefined
a noun or other POS tags. categories. We will focus primarily on the biomedical</p>
        <p>Another case is when the keyphrase consists of a domain using the state-of-the-art biomedical NER tool
mix of generic and specific words, such as “Milky Way”. called HunFlair [38]. We also provide some baseline NERs
“Way” is generally a stopword [32], so the keyphrase ex- for participants to evaluate.
tractor might only be able to detect “Milky” and throw Beyond this workshop, the keyphrase extraction and
away “Way” without realizing that the term “Way” is not NER methods we present are applicable to other text
a stopword in this specific context. corpora, including media texts and legal texts; one only</p>
        <p>Finally, we would like to mention KeyBERT, a state-of- has to aware the domain-specific nature and properly
the-art BERT-based keyword extractor [33]. KeyBERT adjust the algorithm pipeline. As such, we have linked
works by extracting multi-word chunks whose vector the 20 newsgroup text dataset for the participants to try
embeddings are most similar to the original sentence. their keyphrase extraction system on.
Without considering the syntactic structure of the text,
KeyBERT sometimes outputs keyphrases that are incor- 2. Benchmark Dataset
rectly trimmed, such as “algorithm analyzes”, “learning
machine learning”. This problem only worsens with the We take a subset of 46,985 records from the Web of
Sciaforementioned examples from chemistry and astronomy, ence dataset (WOS). The original WOS dataset is provided
since it is not straightforward how to tokenize, i.e., “split”, by Kamran Kowsari in the HDLTex: Hierarchical Deep
words and how to handle non-alphabetic characters. Learning for Text Classification paper [34]. The original
data was provided in .txt format.</p>
        <p>Our Goals and Contributions in this Workshop. For the ease of work, we have pre-processed the
origiDespite the challenges, keyphrase extraction is an im- nal data and store it into .csv dataframe format, which
portant step for many downstream tasks, as already de- would be most compatible with our Python working
scribed. In this workshop, we aim to cover the founda- setup. The final dataframe is in the format as in Table 1,
tions of keyphrase extraction in scientific documents and where (1) each record corresponds to a single scientific
provide a discussion venue for academia and industries document, and (2) has the following columns:
on the topic of keyword extraction. Our contributions in
the workshop are as follows. • Domain: the domain the document belongs to,
(1.) We make a new use of the existing dataset from the</p>
        <p>Web of Science (WOS) [34]. This dataset has been
used as a benchmark dataset for hierarchical
classiifcation systems. Since it comes with reference lists
of keywords, we utilize it as a benchmark dataset • Abstract: the abstract of the document.
for keyword extraction. In this workshop, together
with the participants, we study the feasibility of that
dataset in three systems.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Columns Y1 and Y2 which are simply the index of</title>
        <p>column Domain and area, respectively. Column Y are the
• keywords: the list of keyphrases provided by the
au</p>
        <p>thors, stored as a single string with separator “;”,
• area: the sub-domain the document belongs to,</p>
        <p>Domain area keywords Abstract
Medical Sports Injuries Elastic therapeutic tape; Material properties; Tension test The aim of this study was to analyze stabilometry in athletes...
This study examined the influence of range of motion of</p>
        <p>Medical Senior Health Sports injury; Athletes; Postural stability the ankle joints on elderly people’s balance ability...
sub-sub-domain, which we do not use here but includes cases in the NLP domain including webpage ranking
(betfor reference. ter known as PageRank), extractive text summarization,</p>
        <p>In the corpus, we are provided with scientific arti- and keyword extraction [35, 39, 19, 17, 40, 41]. Across
cles from seven domains: Medical, Computer Science diferent use cases, the base TextRank algorithm remains
(CS), Biochemistry, Psychology, Civil, Electronics and the same; one only needs to adjust what is designated
Communication Engineering (ECE), and Mechanical and as nodes, edges, and edge weights when constructing
Aerospace Engineering (MAE). Therefore, column Y1 con- the graph from the text corpus. The higher edge weight
sists of unique values from 0 to 6. means the higher chance of choosing this particular edge</p>
        <p>In Table 1, note that both records have the same do- to proceed to the next node. For example, in the web
conmain Y1 as “5” corresponding to Domain as “Medical”. text, the PageRank Algorithm considers diferent
webTheir sub-domain Y2 difers: the first record is about pages as nodes and the hyperlinks between webpage
“Sports Injuries”, while the second record is about “Se- pairs as edges. Here, the edges are asymmetrically
dinior Health”. keywords and Abstract of each record rected, since there could be a hyperlink from one page
match its sub-domain. to another but not necessarily vice versa. The edges can</p>
        <p>Finally, the records are splitted at the ratio 70:30 into then be weighted by the number of hyperlinks.
the train/test sets with 32,899 and 14,096 abstracts, re- In our keyword extraction, the TextRank algorithm
spectively. We provide the training set with keywords works by considering terms in text as graph nodes, term
column to the participants for the training of their key- co-occurence as edges, and the number of co-occurence
word and/or NER extraction system, and the test set of two terms within a certain window as the edge
for the participants to evaluate the system. The reason weights. Note that the co-occurence window is a fixed
for splitting the dataframe is so that the participants do pre-specified size (say, 5-gram within sentence boundary).
not overfit their system towards the whole dataset. We Based on this notion, the graph is treated as weighted
encourage them to design their system based on the fea- but undirected.
tures learnt from the training set and apply the identical Subsequently, each term score is given by how “likely”
pipeline to the test set. an agent, starting at a random point in the graph and
continuously jumping along the weighted edges, will end
up at that term node after a long time horizon. The terms
3. Systems with higher scores are then considered more important,
that is, the “keywords” extracted by the TextRank
system.1</p>
      </sec>
      <sec id="sec-1-4">
        <title>Now we discuss the three systems we provide to the</title>
        <p>participants as simple baselines for keyword extraction
using the benchmark dataset. Certainly, there are
various possible extensions to them. We list the participant
contributions under Section 7.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. System 1: TextRank Algorithm</title>
      <sec id="sec-2-1">
        <title>In System 1, we build the TextRank algorithm from scratch and add customizations to our needs, e.g., filtering by Part-of-Speech tags.</title>
        <sec id="sec-2-1-1">
          <title>4.1. TextRank</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The TextRank algorithm is a graph-based algorithm which, as the name suggests, is used to assign scores to texts, thereby giving a ranking [35]. It has numerous use</title>
        <sec id="sec-2-2-1">
          <title>4.2. Implementation</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>We implement a very basic keyword extraction system</title>
        <p>based on the TextRank algorithm from scratch, in order
for the participants to get hands-on experience on how
the algorithm works. Subsequently, we propose
additional improvement ideas so that participants have the
opportunity to be creative and improve the basic system.</p>
        <p>For implementation, we mainly use the Python
package for natural language processing called spaCy [42].
spaCy utilizes pre-trained language models to perform
many NLP tasks, among other things, Part-of-Speech
tag</p>
      </sec>
      <sec id="sec-2-4">
        <title>1In the web analogy, the webpage score would correspond to the</title>
        <p>chance that an Internet user would end up in that webpage after
continuously browsing through the hyperlinks. In this sense, we
retrieve the most popular webpages.
ging (PoS tagging), semantic dependency parsing, and
Named-Entity Recognition. In our case, we use spaCy
along with its small pre-trained model for English
language (en_core_web_sm) as a text pre-processor and
tokenizer. The rest of tasks are handled by usual built-in
Python libraries.</p>
        <p>Our basic system consists of the following steps:
– Use a domain-specific tokenizer such as ScispaCy</p>
        <p>[45] for biomedical data.
– Lemmatize or stem tokens before recording them
in the vocabulary list and building the adjacency
matrix, so that diferent versions of the same words
(such as plural “solitons” and singular “soliton”) are
mapped to the same record.
(1.) Text pre-processing: stopword and punctuation re- • Add the post-processing step:
moval.
(2.) Text tokenization: tokenizing the text and build a</p>
        <p>vocabulary list.
(3.) Build the adjacency matrix from the graph.</p>
        <p>• Matrix index in row and column: terms in the</p>
        <p>vocabulary list.
• Matrix entries: co-occurence of term pairs within</p>
        <p>the same window of pre-specified size.
(4.) Normalize the matrix and compute the stationary</p>
        <p>distribution of the matrix.
(5.) Retrieve keyword(s) corresponding to terms with</p>
        <p>highest stationary probabilities.</p>
        <p>The implemented code is stored as a Jupyter notebook
and hosted on Google Colaboratory and allows the
participant to test and work directly on the code online without
local installation. There, the step-by-step description is
provided and a code sanity check was performed. For
example, our system extracts valid keywords “cute”, “dog”,
“cat” (in descending order by term prominence) for a
short text: “This is a very cute dog. This is another cute
cat. This dog and this cat are cute”.</p>
        <sec id="sec-2-4-1">
          <title>4.3. Further Ideas</title>
          <p>– Exclude keywords that are too short.
• Agglomerate keywords (and perhaps add back some
stopwords) to form “keyphrases” (“the” and “of” should
not be removed within “the Department of Health”).</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>Advanced participants are also directed to another Python package NetworkX, which has a built-in, computationally eficient implementation for the TextRank algorithm [46].</title>
        <sec id="sec-2-5-1">
          <title>4.4. Evaluation: Instance-Based</title>
        </sec>
        <sec id="sec-2-5-2">
          <title>Performance</title>
          <p>In System 1, the objective is instance-based, that is, for
each abstract, we need to evaluate how well the
algorithm performs. The metric could be accuracy, that is,
the ability to find as many keyphrases (compared to the
reference list) as possible. We can also compute the
precision and recall scores (micro or macro). We provide
a simple baseline evaluation function in the notebook.
Here, we allow fuzzy matching algorithms on the phrase
level, where the cut-of ratio and the edit distance
between the candidate term and the reference term can be
adjusted.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. System 2: TextRank with</title>
    </sec>
    <sec id="sec-4">
      <title>Clustering</title>
      <sec id="sec-4-1">
        <title>Inspired by existing keyword extraction systems in</title>
        <p>Python such as summa [43] and pke [44], we have
provided participants with a list of ideas to further im- In System 2, we extend the TextRank keyword extraction
prove the keyword extraction system along with hints described in System 1 (see Section 4) and apply it to a
for Python implementation using spaCy (see the Jupyter group of texts clustered by the K-Means algorithm. In this
notebook): way, we obtain a more focused keyword list specifically
for each text group and learn about its characteristics.
• Improve the pre-processing step:
– Remove numbers.
– Standardize casings, such as lower-casing the entire</p>
        <p>text.
– Use a domain-specific or custom-made stopword</p>
        <p>list.
• Improve the tokenization step:
– Filter by Part-of-Speech tags to only include nouns
in the vocabulary list.</p>
        <sec id="sec-4-1-1">
          <title>5.1. K-Means Algorithm</title>
          <p>The K-Means algorithm is a clustering algorithm which
partitions points in a vector space into “K” clusters (“K”
being pre-specified), such that each point belongs to the
cluster with the nearest cluster centroid (called “Means”)
[36, 37]. It works in the following steps.
(1.) Assign k random points as the cluster “means”.</p>
          <p>(2.) Doing the following until the convergence:
a) Assignment step: Assign each point to the clus- ofers several pre-trained models for diferent purposes,
ter with the least squared Euclidean distance to from which we choose the small model
(all-MiniLM-L6the cluster mean, v2).
b) Update step: Recalculate the “mean” as the av- Second, to group the documents, we use the
impleerage of all the points assigned to each cluster, mentation in the package sklearn [52]. Furthermore,
we provide a cluster visualization using the package
c) Terminate when the cluster assignment stabi- matplotlib [53]. We set the parameter  = 7 for the
lizes. K-Means algorithm, which is the number of disciplines
We ultimately choose the K-Means algorithm for clus- in the WOS dataset.
tering because of its low complexity: it works very fast Finally, we extract the keyphrases from each cluster.
for large datasets like ours [47, 48]. Often, one hidden Unlike in System 1, we do not implement the TextRank
alcaveat about the K-Means algorithm is the choice of the gorithm from scratch, but instead use the existing Python
number of clusters “K”. However, in our specific use case package pke [44]. pke provides implementations of
nuwith the scientific publications, we usually have a good merous keyword extraction algorithms from publications,
estimate based on the number of target disciplines. There- as well as allowing customization such as Part-of-Speech
fore, K-Means serves our purpose well. tag filters and the limit on the maximum number of words
in a single keyphrase. In our case, we simply use the basic
TextRank algorithm, also to demonstrate that even the
5.2. Preprocessing: Sentence-BERT very basic algorithm can already yield satisfying outputs.</p>
          <p>Embeddings Like in System 1, the code implemented for System
2 is stored as a Jupyter notebook and hosted on Google
Colaboratory. The step-by-step description is provided,
and a code sanity check succeeds at characterizing a
cluster: the cluster mostly consisting of medical articles has
relevant keyphrases such as “patient group”, “treatment
efects”, “autism patient” among the top-10 extracted
keyphrase list.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>As mentioned in the previous section, K-Means clusters</title>
        <p>points in a vector space. Therefore, we need to transform
each text in our dataset into a vector representation. This
is often done by averaging pre-trained word embeddings
over all the words that appear in the document,
regardless of whether they are context-free embeddings like
GloVe [49] or contextualized embeddings like BERT [50].</p>
        <p>However, this has been shown to perform worse than
directly deriving contextualized sentence embeddings 5.4. Further Ideas
(Sentence-BERT [51]). Therefore, we opt for
contextualized sentence embeddings from Sentence-BERT, which
is trained on the Siamese BERT networks [51]. More
technical details can be found in the original paper by N. • Customize the TextRank algorithm:
Reimers and I. Gurevych [51].</p>
        <p>The Sentence-BERT transforms each text into a 384- – Change the window size.
dimensional semantically meaningful vector, which is
now ready to be an input to the K-Means algorithm for
clustering.</p>
      </sec>
      <sec id="sec-4-3">
        <title>We invite participants to explore improvement ideas and provide coding hints on how to implement them on pke: • Use alternative keyword extraction algorithms to the TextRank algorithm, such as:</title>
        <sec id="sec-4-3-1">
          <title>5.3. Implementation</title>
          <p>We add the clustering step to our pipeline, which
efectively results in the following procedure:
(1.) For each document, extract its Sentence-BERT</p>
          <p>embedding,
(2.) Cluster the documents into K groups based on
their Sentence-BERT embeddings, i.e., by the
sentence contents,
(3.) For each document cluster, extract its keyphrases.</p>
          <p>First, we generate embedding representations for each Using a similar evaluation function as in System 1 (See
text, which is very easy by the Python package sentence- Section 4.4), we now look at a cluster-based objective.
transformers. The package sentence-transformers
– The TopicRank algorithm [54],
– The Multipartite algorithm [55],
– The BERTopic algorithm [56].
• Impose extra criteria on valid keyphrases, such as:
– Change the maximum number of words allowed in</p>
          <p>a single keyphrase,
– Restrict the keyphrase to only contain the top
cer</p>
          <p>tain percentage of all keywords.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>5.5. Evaluation: Cluster-Based</title>
        </sec>
        <sec id="sec-4-3-3">
          <title>Performance</title>
          <p>This means that we take all the keywords from the
articles clustered in the same group and build a new
reference list of keywords. Subsequently, the evaluation of the
user-generated list will be compared with this expanded
list. Notably, this approach increases the coverage of
keywords in the reference, in the hope of covering more
out-of-abstract keywords in this expanded list. However,
it comes at the cost of increasing the denominator when
we compare the user-generated list to the reference list.
One way to better present the reference list of one
cluster is to process the list by criteria such as frequency.
Another way to evaluate is using word embedding
similarities (c.f. KeyBert [33] as an example of leveraging
embeddings). In this way, we have a better view of the
extracted keywords and the degree to which the
usergenerated list is close to the reference list. In particular,
this technique is useful for assessing the diference set
between the user-generated list and the reference one.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>6.1. Named Entities, Named-Entity</title>
        </sec>
        <sec id="sec-4-3-5">
          <title>Recognition and Keyword Extraction</title>
          <p>6. System 3: Named-Entity are not limited by the fixed categories of an NER model,
Recognition as Keyword and may contain named entities if those entities are
representative of a given document. For example, a document
Extraction about Heathrow Airport can contain keywords such as
“arrival”, “customs”, “departure”, “duty free”,
“immigraThe goal of system 3 is to emulate some of the constraints tion” and “London”. Depending on the model classes, an
that may exist in a practical setting. These could be sit- NER model on the same text could extract entities such
uations where a keyword extractor system cannot be as “British Airways” (ORG), “London” (LOC), “United
implemented as the output of these systems may be in- Kingdom” (LOC), etc. In this example, there is overlap
correct or non-sensical. Another situation could be that between the keywords and named entities; however, due
one is required to use existing tools such as a Named- to the defining characteristics of both approaches, there
Entity Recognition system and must enact measures to is a significant diference between the lists.
improve the output of the model. Figure 2 demonstrates the use of keyword extraction
and named-entity recognition in the industry setting
at Neue Zürcher Zeitung (NZZ), where key terms are
extracted and relevant articles are assigned to the terms.</p>
          <p>A named entity (NE) in most cases is a proper noun, the 6.2. Use of Keywords in the News Domain
most common categories being person, location and
organization; however, other categories that are not proper As mentioned above, for a given text, keywords and the
nouns, such as temporal expressions, are also possible. output of a NER model may overlap. When it comes
Named-Entity Recognition consists of locating and classi- to analyzing news, a typical NER model (with common
fying named entities mentioned in unstructured text into categories such as person, organization, and location)
predefined categories [ 57, Chapter. 8.3]. Keywords are excels at finding named entities for the model-specific
single or multi-word expressions that under ideal circum- categories. However, only extracting the entities is
inadstances should concisely represent the key content of a equate for finding nuanced diferences between multiple
document [58, Page 3]. As the goal of NER is to assign articles that contain identical named entities. In Table 2
a label to spans of text [57, Chapter. 8.3], it is a classi- we see the titles of 10 articles published in Neue Zürcher
ifcation task that can be solved by building a machine Zeitung (NZZ) during March 2022. According to the NER
learning model [59]. model for German texts used internally by the NZZ, all</p>
          <p>The diference between keyword extraction and NER articles have “Ukraine” (location) as a common named
is as follows. Named entities are words or phrases with a entity. Despite the similarities, there are thematic
difspecific label determined by predefined classes of a given ferences between these articles. After using a keyword
NER model. Therefore, these entities may not necessarily extraction system that uses similar methodologies
menrepresent the essential content of a document. Keywords tioned in Systems 1 and 2, keywords that are not named
Number NZZ Article Title
1 Eine Zürcherin n«iEminmetSuoklirdaainriitsäcthsebeFklüucnhdtulinnggeauafufIn–stuangdrafmühzlutspicohstveonm,reSitcahattnailclheitn»g:elassen
1325467980 1SSPKVCNN5iiuriheecce0ituuhhreliieUgntteesrrrr,tZkaaihhodnrllüpeeiiaittrediihiääcttnessStthBerpp:cei-lmooUFhrFolliawckliinüttnUhrneiicadkkekiehzi::rrnnit:anSMuleNiiuon:nndinledlWgeeiiddu-terdKäatiZSesirrrreaiciiüinessehlzrciicdgtcwwnhhähi-eeeetSmeiiwVNirzNsSKPomtgeee-iueunignDötrttieedcacrronnhaaehdralrltftediideuttFneenääosrluittrüntFfNrwecel–hüSheeZetcutiwltlhetdibeniritiestralegeditlngzeniSetwegwncäahehektuetcweafnkennudesisiemzwd?smeetarittteUtrk?raine bei sich zu Hause aufnehmen</p>
        </sec>
        <sec id="sec-4-3-6">
          <title>6.4. Pre-Trained NER Models</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>There are some disadvantages to using pre-trained NER</title>
        <p>models. One should take into consideration that using a
pre-trained model to extract named entities out of
documents from diferent domains can result in a fall in model
performance [65]. The training data and categories of
Table 2 the model will influence the output. For example, the
Titles of 10 articles published in Neue Zürcher Zeitung (NZZ) string “ATP” can be labeled as an organization (e.g.
Asduring March 2022. sociation of Tennis Professionals) by one model and as a
chemical (e.g. adenosine triphosphate) by a
biomedicalNER model. Creating an NER model for a specific type
entities were found. These keywords demonstrate the- of entity requires the annotation of a corpus, which can
matic groupings between the articles. The most common be a significant expense and efort for the user [ 65].
keyword for articles 1-4 is “Flüchtlinge” (“refugees”), and
for articles 5-10 is “Neutralität” (“neutrality”). This difer- 6.5. Further Ideas
ence can also be observed in the article titles, and upon
closer inspection of the article content, it is evident that
some of the articles (1-4) revolve around the topic of
refugees from Ukraine, while other articles (5-10) discuss
the notion of neutrality. Using named entities or, in some
cases, a predefined list of keywords can be useful to
deifne broad topic pages (see nzz.ch/themen), but keywords
ofer concise yet semantically insights into the content
of a document. Therefore, they can be potentially used
to automatically identify possible subtopics with a news
story or discover emerging topics from newly published
articles.</p>
      </sec>
      <sec id="sec-4-5">
        <title>The challenge of this system lies in working with pre</title>
        <p>calculated data from systems that cannot be influenced.
The participants are provided with multiple tables with
the output of two diferent NER systems, fastText
document, and word vectors (see Section 6.3). In addition,
they also have a table at their disposal to verify whether
a keyword for a given document is present in the abstract
and whether it was discovered by any of the NER models
(with 100% string matches). The intuition of System 3
is that given the resources (cost, time, hardware), one
needs to come up with the best possible strategies to
detect meaningful keywords.</p>
        <sec id="sec-4-5-1">
          <title>6.3. Data Preparation</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>2https://fasttext.cc/ (last accessed: June 20, 2022).</title>
        <p>3https://fasttext.cc/docs/en/crawl-vectors.html (last accessed: June
20, 2022).</p>
        <sec id="sec-4-6-1">
          <title>6.6. Evaluation: Instance-based</title>
        </sec>
        <sec id="sec-4-6-2">
          <title>Performance</title>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>The FLAIR framework [60] was chosen as it contains</title>
        <p>many out-of-the-box NER models for generic and
biomedical texts. Furthermore, the framework is also useful In addition to the pre-calculated data, the participants
for integrating pre-trained embeddings and models. As were also given evaluation functions to compare
difermany of the texts are from the biomedical domain, the ences between their system NER model output and the
ScispaCy library was used for word and sentence tok- keyword list that came with the documents. There are
enization [61]. The results of the NER models were given cases where an item from the curated keyword list does
to the participants. The ner-english model is a 4-class not contain the keyword in the abstract, or contains a
NER model for English, which comes with FLAIR [62]. partial or inflected form of the keyword. The evaluation
This model has the following categories: locations (LOC), function contains a partial string matching sequence,
persons (PER), organizations (ORG), and miscellaneous where one can choose the amount of character similarity
(MISC) [63]. We also provided participants with NER between two strings. For example, a document has the
results from HunFlair [38], which is an NER tagger for label “radio frequency”, but the string “radio
frequenbiomedical texts. This biomedical NER tagger is based on cies” is present in the abstract and the inflected form
the HUNER tagger, and has the follwing named-entity cat- was also found by one of the NER models. For this case,
egories: Chemicals, Diseases, Species, Genes or Proteins, participants can set a string similarity value (e.g., 80%
and Cell lines [64]. As an additional hint to participants, similarity) to circumvent the issues caused by inflected
document embeddings for each item in the train and test forms, or partially mentioned forms (“radio frequency”
sets, as well as word embeddings for the entire corpus, vs. “radio frequency scanner”). Using the resources at
were generated from a fastText model2 trained on the their disposal, participants must develop the best possible
English Common Crawl dataset (cc.en.300.bin)3. strategies to build a system that can detect the maximum
number of relevant keywords.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7. Participant Contributions</title>
      <sec id="sec-5-1">
        <title>Our participants have further investigated keyphrase extractions in System 1 and provided valuable contributions to our proceedings. Their original theses can be founded at the following Google Drive folder.</title>
        <p>The basic TextRank keyword extractor in System 1
has been extended to account for the following data
preprocessing steps: (1) remove numbers; (2) restrict valid
keywords to only nouns; (3) restrict valid keywords by
imposing the minimum string length. The contribution
can be found on the Google Drive folder.</p>
        <p>Additionally, the evaluation system has been
generalized to output numerical performance scores,
allowing simpler comparisons of diferent keyword extractors.
The contribution can be found on the Google Drive folder.</p>
        <p>Finally, a comparison between the TextRank algorithm
and further unsupervised keyphrase extraction methods
has been provided. The limitation of TextRank is that it
only considers the co-occurences of the word pair and
not the semantical meanings, which may cause certain
extracted “frequent” word pairs to either be irrelevant or
under-represented. Therefore, an experiment has been
performed using the pke library to compare the
performance of the TextRank algorithm and several other
unsupervised keyphrase extraction algorithms on the
benchmark test dataset. The contribution can be found on the
Google Drive folder.</p>
        <p>Beyond the academic setting, the use of keyword
extractions is demonstrated in the industry setting, where
Wyona AG utilizes keyword extractors in the working
pipeline of the Q&amp;A Chatbot “Katie”. The contribution
can be found on the Google Drive folder.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>8. Conclusion</title>
      <sec id="sec-6-1">
        <title>In this workshop, we provided the background and base</title>
        <p>line systems for keyword extraction, shared a benchmark
dataset on scientific keyword extraction, and invited
contributions from participants from industry and academia.
The methodologies discussed can be extended to keyword
extraction in other domains (e.g., legal and news).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <sec id="sec-7-1">
        <title>The authors would like to thank the organizers from Swis</title>
        <p>sText2022 for hosting our workshop. Peter Egger and the
Chair of Applied Economics acknowledge the support of
the Department of Management, Technology, and
Economics at ETH Zurich. Ce Zhang and the DS3Lab
gratefully acknowledge the support from the Swiss State
Secretariat for Education, Research and Innovation (SERI)
under contract number MB22.00036 (for European Research
Council (ERC) Starting Grant TRIDENT 101042665), the</p>
      </sec>
      <sec id="sec-7-2">
        <title>Swiss National Science Foundation (Project Number</title>
        <p>200021_184628, and 197485), Innosuisse/SNF BRIDGE
Discovery (Project Number 40B2-0_187132), European
Union Horizon 2020 Research and Innovation Programme
(DAPHNE, 957407), Botnar Research Centre for Child
Health, Swiss Data Science Center, Alibaba, Cisco, eBay,
Google Focused Research Awards, Kuaishou Inc., Oracle
Labs, Zurich Insurance, and the Department of Computer
Science at ETH Zurich. We would like to thank Neue
Zürcher Zeitung for collaborating on this project.
Y. Zhang, C. Zhang, P. Mayr, A. Suominen (Eds.), [42] I. Montani, M. Honnibal, M. Honnibal, S. V.
Proceedings of the 1st Workshop on AI + Informet- Landeghem, A. Boyd, H. Peters, P. O. McCann,
rics (AII2021) co-located with the iConference 2021, M. Samsonov, J. Geovedi, J. O’Regan, D. Altinok,
Virtual Event, March 17th, 2021, volume 2871 of G. Orosz, S. L. Kristiansen, D. de Kok, L.
MiCEUR Workshop Proceedings, CEUR-WS.org, 2021, randa, Roman, E. Bot, L. Fiedler, G. Howard,
Edpp. 58–70. ward, W. Phatthiyaphaibun, R. Hudson, Y. Tamura,
[31] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, S. Bozek, murat, R. Daniels, P. Baumgartner,
J. Oyarzábal, A. Valencia, Chemdner: The drugs M. Amery, B. Böing, explosion/spaCy: New
and chemical names extraction challenge, Journal Span Ruler component, JSON (de)serialization of
of Cheminformatics 7 (2015) S1 – S1. Doc, span analyzer and more, 2022. doi:10.5281/
[32] M. F. Porter, An Algorithm for Sufix Stripping, Mor- zenodo.6621076.</p>
        <p>gan Kaufmann Publishers Inc., San Francisco, CA, [43] F. Barrios, F. López, L. Argerich, R. Wachenchauzer,
USA, 1997, p. 313–316. Variations of the similarity function of textrank for
[33] M. Grootendorst, Keybert: Minimal keyword ex- automated summarization, CoRR abs/1602.03606
traction with bert., 2020. doi:10.5281/zenodo. (2016). arXiv:1602.03606.</p>
        <p>4461265. [44] F. Boudin, pke: an open source python-based
[34] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja- keyphrase extraction toolkit, in: Proceedings of
fari Meimandi, , M. S. Gerber, L. E. Barnes, Hdltex: COLING 2016, the 26th International Conference
Hierarchical deep learning for text classification, on Computational Linguistics: System
Demonstrain: Machine Learning and Applications (ICMLA), tions, Osaka, Japan, 2016, pp. 69–73.
2017 16th IEEE International Conference on, IEEE, [45] M. Neumann, D. King, I. Beltagy, W. Ammar,
Scis2017. paCy: Fast and robust models for biomedical
natu[35] R. Mihalcea, P. Tarau, TextRank: Bringing order ral language processing, in: Proceedings of the 18th
into text, in: Proceedings of the 2004 Conference BioNLP Workshop and Shared Task, Association
on Empirical Methods in Natural Language Pro- for Computational Linguistics, Florence, Italy, 2019,
cessing, Association for Computational Linguistics, pp. 319–327. doi:10.18653/v1/W19-5034.</p>
        <p>Barcelona, Spain, 2004, pp. 404–411. [46] A. Hagberg, P. Swart, D. S Chult, Exploring network
[36] S. Lloyd, Least squares quantization in pcm, structure, dynamics, and function using networkx
IEEE Transactions on Information Theory 28 (1982) (2008).</p>
        <p>129–137. doi:10.1109/TIT.1982.1056489. [47] D. Xu, Y. Tian, A comprehensive survey of
clus[37] J. MacQueen, Classification and analysis of multi- tering algorithms, Annals of Data Science 2 (2015)
variate observations, in: 5th Berkeley Symp. Math. 165–193. doi:10.1007/s40745-015-0040-1.</p>
        <p>Statist. Probability, 1967, pp. 281–297. [48] A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade,
[38] L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, L. Abualigah, J. O. Agushaka, C. I. Eke, A. A.</p>
        <p>U. Leser, A. Akbik, HunFlair: an easy-to-use Akinyelu, A comprehensive survey of clustering
tool for state-of-the-art biomedical named entity algorithms: State-of-the-art machine learning
aprecognition, Bioinformatics 37 (2021) 2792–2794. plications, taxonomy, challenges, and future
redoi:10.1093/bioinformatics/btab042. search prospects, Engineering Applications of
Ar[39] J. Son, Y. Shin, Music lyrics summarization method tificial Intelligence 110 (2022) 104743. doi: https:
using textrank algorithm, Journal of Korea Multi- //doi.org/10.1016/j.engappai.2022.104743.
media Society 21 (2018) 45–50. doi:https://doi. [49] J. Pennington, R. Socher, C. Manning, GloVe: Global
org/10.9717/kmms.2018.21.1.045. vectors for word representation, in: Proceedings of
[40] C. Wu, L. Liao, F. Afedzie Kwofie, F. Zou, Y. Wang, the 2014 Conference on Empirical Methods in
NatuM. Zhang, Textrank keyword extraction method ral Language Processing (EMNLP), Association for
based on multi-feature fusion, in: X.-S. Yang, Computational Linguistics, Doha, Qatar, 2014, pp.
S. Sherratt, N. Dey, A. Joshi (Eds.), Proceedings of 1532–1543. doi:10.3115/v1/D14-1162.
Sixth International Congress on Information and [50] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
Communication Technology, Springer Singapore, Pre-training of deep bidirectional transformers for
Singapore, 2022, pp. 493–501. language understanding, in: Proceedings of the
[41] S. Pan, Z. Li, J. Dai, An improved textrank key- 2019 Conference of the North American
Chapwords extraction algorithm, in: Proceedings of the ter of the Association for Computational
LinguisACM Turing Celebration Conference - China, ACM tics: Human Language Technologies, Volume 1
TURC ’19, Association for Computing Machinery, (Long and Short Papers), Association for
CompuNew York, NY, USA, 2019. doi:10.1145/3321408. tational Linguistics, Minneapolis, Minnesota, 2019,
3326659. pp. 4171–4186. doi:10.18653/v1/N19-1423.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Keyphrase extraction in scientific publications</article-title>
          , in: D. H.
          <string-name>
            <surname>-L. Goh</surname>
            ,
            <given-names>T. H.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>I. T.</given-names>
          </string-name>
          <string-name>
            <surname>Sølvberg</surname>
          </string-name>
          , E. Rasmussen (Eds.),
          <source>Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2007</year>
          , pp.
          <fpage>317</fpage>
          -
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Kim</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Re-examining automatic keyphrase extraction approaches in scientific articles</article-title>
          ,
          <source>in: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE</source>
          <year>2009</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2009</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Paynter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Nevill-Manning</surname>
          </string-name>
          ,
          <article-title>Domain-specific keyphrase extraction</article-title>
          , in: T.
          <string-name>
            <surname>Dean</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99</source>
          , Stockholm, Sweden,
          <source>July 31 - August 6</source>
          ,
          <year>1999</year>
          . 2 Volumes, 1450 pages, Morgan Kaufmann,
          <year>1999</year>
          , pp.
          <fpage>668</fpage>
          -
          <lpage>673</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutwin</surname>
          </string-name>
          , G. Paynter, I. Witten,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nevill-Manning</surname>
          </string-name>
          , E. Frank,
          <article-title>Improving browsing in digital libraries with keyphrase indexes, Decision Support Systems 27 (</article-title>
          <year>1999</year>
          )
          <fpage>81</fpage>
          -
          <lpage>104</lpage>
          . doi:https://doi.org/10.1016/ S0167-
          <volume>9236</volume>
          (
          <issue>99</issue>
          )
          <fpage>00038</fpage>
          -
          <lpage>X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Medelyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <article-title>Domain-independent automatic keyphrase indexing with small training sets</article-title>
          ,
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>59</volume>
          (
          <year>2008</year>
          )
          <fpage>1026</fpage>
          -
          <lpage>1040</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <article-title>Keyword extraction for improved document retrieval in conversational search</article-title>
          ,
          <source>CoRR abs/2109</source>
          .05979 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>T</surname>
          </string-name>
          . Kim,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Web document clustering by using automatic keyphrase extraction</article-title>
          ,
          <source>in: 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>59</lpage>
          . doi:
          <volume>10</volume>
          .1109/WI- IATW.
          <year>2007</year>
          .
          <volume>46</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K. M. Hammouda</surname>
            ,
            <given-names>D. N.</given-names>
          </string-name>
          <string-name>
            <surname>Matute</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Kamel</surname>
          </string-name>
          , Corephrase:
          <article-title>Keyphrase extraction for document clustering</article-title>
          , in: P.
          <string-name>
            <surname>Perner</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Imiya (Eds.),
          <source>Machine Learning and Data Mining in Pattern Recogni-</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>