1. Introduction

or journal

10.3390/e20020104

Keyword Extraction in Scientific Documents

Susie Xi Rao

srao@ethz.ch 0 1

Piriyakorn Piriyatamwong

0 1

Parijat Ghoshal

parijat.ghoshal@nzz.ch 1 4

Sara Nasirian

sara.nasirian@supsi.ch 1 2

Sandra Mitrović

sandra.mitrovic@supsi.ch 1 2

Emmanuel de Salis

emmanuel.desalis@he-arc.ch 1 3

Michael Wechner

michael.wechner@wyona.com 1 5

Vanya Brucker

vanya.brucker@wyona.com 1 5

Peter Egger

pegger@ethz.ch 0 1

Ce Zhang

cezhang@ethz.ch 0 1

Lugano, Switzerland

0 Chair of Applied Economics, ETH Zurich , Switzerland 1 Challenge 2: Evaluation of Keyword Extraction 2 Dalle Molle Institute for Artificial Intelligence , Lugano , Switzerland 3 Haute-Ecole Arc , Neuchâtel , Switzerland 4 Neue Zürcher Zeitung AG , Zurich , Switzerland 5 Wyona AG , Zurich , Switzerland

2016

2 265 274

The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications. ten nouns) that capture the main ideas of a given text, such as economics, are required to have author-generated datasets and keyword reference lists, as authors often ∗Corresponding author. †These authors contributed equally.

annotation is dificult to control the quality [ 2 3 11]

1. Introduction Keyphrases are single- or multi-word expressions (of

but do not necessarily appear in the text itself [ 1, 2, 3 ]. Keyphrases have been shown to be useful for many tasks in the Natural Language Processing (NLP) domain, such as (1.) indexing, archiving and pinpointing information in the Information Retrieval (IR) domain [ 3, 4, 5, 6 ], (2.) document clustering [ 3, 7, 8 ], and (3.) summarizing texts [ 3, 9, 10, 11 ], just to name a few. ous application domains, ranging from the scientific community [ 1, 2, 12 ], finance [ 13, 14], law [15], news media [11, 16, 17], patenting [18, 19], and medicine [20, 21, 22].

Despite being a seemingly straightforward task for human domain experts, performing automatic keyphrase extraction is a challenging task.

Defining an evaluation protocol and a corresponding metric is far from trivial for the following reasons. (1.) We should look at the ground truth list of keywords in a critical way. As we mentioned above, there can exist more than one ground truth list of keyphrases given an abstract. The keyword list provided in our dataset is a reference list of words provided by authors or by publishers. One should only treat do not provide their keyphrase list unless explicitly re- require a large, well-annotated training dataset [16]. The quested or required to do so [ 3 ]. In scientific publications, lack of training datasets also poses challenges for the Keyphrase extraction has been at the forefront of vari- keyphrase list, even when reference lists are not readily (a) Web of Science.

(b) Google Scholar. (c) Scopus.

(d) Microsoft Academic.

this list as a reference list, but not the one and only correct list of keywords. cally equivalent matches [27, 28, 24]. There are other evaluation methods which account for the ranks and orders in the extracted keywords, see this Medium article for inspiration [24]. (2.) There are diferent aims in extracting keyphrases in system design. As we will introduce in the rationale of designing the three systems in Section 3, the systems are designed to tackle various problems and, therefore, are optimized for diferent use cases.

System 1 uses a simple TextRank algorithm (see Section 4), which outputs the most prominent set of keyphrases/keywords; System 2 uses TextRank on top of a clustering algorithm (see Section 5), which is targeted at grouping similar articles and then learns from the cluster of articles; and System 3 uses pretrained models and tools on Named-Entity Recognition (NER) (see Section 6), with a goal to fully utilize existing models and tools by only pre-processing the input and/or post-processing the output.

Challenge 3: Growing Number of Scientific Publications. During the last decades, the number of scientific publications has increased exponentially each year [29], making it increasingly challenging for researchers to keep track of trends and changes, even strictly in their own field of interest [ 3, 30 ]. This bolsters the need for automatic keyword extraction for the use case as a text recommendation and summarization system. The efect of increasing publications is clearly visible in major academic search engines such as Google Scholar, Web of Science, Scopus, and Microsoft Academics. In a simple query (“data mining”), three out of four failed to bring up relevant scientific publications that are prominent in (3.) There are diferent objective functions that we want the field and anticipated by human domain experts. to optimize. Precision, recall, accuracy, false posi- See the query results in Figure 1 of a keyword search tive rate, and false negative rate are among the most “data mining” in diferent academic products. We can see common performance metrics for various applica- that the search results in diferent products vary largely, tion scenarios [23]. We might also consider the order and it could be dificult for readers to choose between the of keyphrases, for example, as sorted by criteria such diferent results without having prior knowledge of the as frequency, TextRank score [24, 25]. In search en- field. So far, only Microsoft Academic Services (Figure 1 gines, the hit rate is also an important metric [26]. (d)) has returned relevant research results that point to Furthermore, one can evaluate exact matches and the most influential author and work in the field of data fuzzy matches. Fuzzy matches can also be broken mining. This is because Microsoft Academic Service has down into two types: “partial” matches and semanti- enabled a hierarchical discipline classification (indexed by keyphrases) that supports its users when reviewing the search results. In summary, without relevant and correct keyphrases, efective indexing and thus querying is not feasible. (2.) We introduce three commonly used systems in academia and industry for keyword extraction. For the various use cases of keyword extraction, we also design baseline evaluation metrics for each system.

Challenge 4: Domain-Specific Keyword Extraction. (3.) We encourage participants to discuss, extend, and Another challenge in keyphrase extraction is its domain- evaluate the systems that we have introduced. specific nature. One case is when a keyphrase extractor trained in generic texts may miss out technical terms that System Design of Keyword Extraction. For the keydo not look like usual keyword noun chunks, such as the word extraction, we provide two systems based on the chemical name “C4H*Cl” [31]. The issue arises from the unsupervised, graph-based algorithm TextRank [35]. Systokenization step: a non-alphabetic character such as “4” tem 1 (see Section 4) is to develop the TextRank keyword and “*” might be treated as a separator, and thus such extractor from scratch in order to understand the reaa keyword gets split into “C”, “H” and “Cl”, losing its soning behind it. System 2 (see Section 5) combines the original notion. Even if the separator works perfectly, TextRank algorithm with the K-Means clustering algothis type of chemical name would still confuse keyphrase rithm [36, 37] to provide keyphrases for each specific extractors that filter candidate keyphrase based on Part- ifeld (“cluster”). In System 3 (see Section 6), we cover of-Speech (POS) tags. This is because for POS-based the NER task, where an entity in the sentence is identiextractors, it is unclear whether “C4H*Cl” is an adjective, fied as person, organization, and others from predefined a noun or other POS tags. categories. We will focus primarily on the biomedical

Another case is when the keyphrase consists of a domain using the state-of-the-art biomedical NER tool mix of generic and specific words, such as “Milky Way”. called HunFlair [38]. We also provide some baseline NERs “Way” is generally a stopword [32], so the keyphrase ex- for participants to evaluate. tractor might only be able to detect “Milky” and throw Beyond this workshop, the keyphrase extraction and away “Way” without realizing that the term “Way” is not NER methods we present are applicable to other text a stopword in this specific context. corpora, including media texts and legal texts; one only

Finally, we would like to mention KeyBERT, a state-of- has to aware the domain-specific nature and properly the-art BERT-based keyword extractor [33]. KeyBERT adjust the algorithm pipeline. As such, we have linked works by extracting multi-word chunks whose vector the 20 newsgroup text dataset for the participants to try embeddings are most similar to the original sentence. their keyphrase extraction system on. Without considering the syntactic structure of the text, KeyBERT sometimes outputs keyphrases that are incor- 2. Benchmark Dataset rectly trimmed, such as “algorithm analyzes”, “learning machine learning”. This problem only worsens with the We take a subset of 46,985 records from the Web of Sciaforementioned examples from chemistry and astronomy, ence dataset (WOS). The original WOS dataset is provided since it is not straightforward how to tokenize, i.e., “split”, by Kamran Kowsari in the HDLTex: Hierarchical Deep words and how to handle non-alphabetic characters. Learning for Text Classification paper [34]. The original data was provided in .txt format.

Our Goals and Contributions in this Workshop. For the ease of work, we have pre-processed the origiDespite the challenges, keyphrase extraction is an im- nal data and store it into .csv dataframe format, which portant step for many downstream tasks, as already de- would be most compatible with our Python working scribed. In this workshop, we aim to cover the founda- setup. The final dataframe is in the format as in Table 1, tions of keyphrase extraction in scientific documents and where (1) each record corresponds to a single scientific provide a discussion venue for academia and industries document, and (2) has the following columns: on the topic of keyword extraction. Our contributions in the workshop are as follows. • Domain: the domain the document belongs to, (1.) We make a new use of the existing dataset from the

Web of Science (WOS) [34]. This dataset has been used as a benchmark dataset for hierarchical classiifcation systems. Since it comes with reference lists of keywords, we utilize it as a benchmark dataset • Abstract: the abstract of the document. for keyword extraction. In this workshop, together with the participants, we study the feasibility of that dataset in three systems.

Columns Y1 and Y2 which are simply the index of

column Domain and area, respectively. Column Y are the • keywords: the list of keyphrases provided by the au

thors, stored as a single string with separator “;”, • area: the sub-domain the document belongs to,

Domain area keywords Abstract Medical Sports Injuries Elastic therapeutic tape; Material properties; Tension test The aim of this study was to analyze stabilometry in athletes... This study examined the influence of range of motion of

Medical Senior Health Sports injury; Athletes; Postural stability the ankle joints on elderly people’s balance ability... sub-sub-domain, which we do not use here but includes cases in the NLP domain including webpage ranking (betfor reference. ter known as PageRank), extractive text summarization,

In the corpus, we are provided with scientific arti- and keyword extraction [35, 39, 19, 17, 40, 41]. Across cles from seven domains: Medical, Computer Science diferent use cases, the base TextRank algorithm remains (CS), Biochemistry, Psychology, Civil, Electronics and the same; one only needs to adjust what is designated Communication Engineering (ECE), and Mechanical and as nodes, edges, and edge weights when constructing Aerospace Engineering (MAE). Therefore, column Y1 con- the graph from the text corpus. The higher edge weight sists of unique values from 0 to 6. means the higher chance of choosing this particular edge

In Table 1, note that both records have the same do- to proceed to the next node. For example, in the web conmain Y1 as “5” corresponding to Domain as “Medical”. text, the PageRank Algorithm considers diferent webTheir sub-domain Y2 difers: the first record is about pages as nodes and the hyperlinks between webpage “Sports Injuries”, while the second record is about “Se- pairs as edges. Here, the edges are asymmetrically dinior Health”. keywords and Abstract of each record rected, since there could be a hyperlink from one page match its sub-domain. to another but not necessarily vice versa. The edges can

Finally, the records are splitted at the ratio 70:30 into then be weighted by the number of hyperlinks. the train/test sets with 32,899 and 14,096 abstracts, re- In our keyword extraction, the TextRank algorithm spectively. We provide the training set with keywords works by considering terms in text as graph nodes, term column to the participants for the training of their key- co-occurence as edges, and the number of co-occurence word and/or NER extraction system, and the test set of two terms within a certain window as the edge for the participants to evaluate the system. The reason weights. Note that the co-occurence window is a fixed for splitting the dataframe is so that the participants do pre-specified size (say, 5-gram within sentence boundary). not overfit their system towards the whole dataset. We Based on this notion, the graph is treated as weighted encourage them to design their system based on the fea- but undirected. tures learnt from the training set and apply the identical Subsequently, each term score is given by how “likely” pipeline to the test set. an agent, starting at a random point in the graph and continuously jumping along the weighted edges, will end up at that term node after a long time horizon. The terms 3. Systems with higher scores are then considered more important, that is, the “keywords” extracted by the TextRank system.1

Now we discuss the three systems we provide to the

participants as simple baselines for keyword extraction using the benchmark dataset. Certainly, there are various possible extensions to them. We list the participant contributions under Section 7.

4. System 1: TextRank Algorithm In System 1, we build the TextRank algorithm from scratch and add customizations to our needs, e.g., filtering by Part-of-Speech tags. 4.1. TextRank The TextRank algorithm is a graph-based algorithm which, as the name suggests, is used to assign scores to texts, thereby giving a ranking [35]. It has numerous use 4.2. Implementation We implement a very basic keyword extraction system

based on the TextRank algorithm from scratch, in order for the participants to get hands-on experience on how the algorithm works. Subsequently, we propose additional improvement ideas so that participants have the opportunity to be creative and improve the basic system.

For implementation, we mainly use the Python package for natural language processing called spaCy [42]. spaCy utilizes pre-trained language models to perform many NLP tasks, among other things, Part-of-Speech tag

1In the web analogy, the webpage score would correspond to the

chance that an Internet user would end up in that webpage after continuously browsing through the hyperlinks. In this sense, we retrieve the most popular webpages. ging (PoS tagging), semantic dependency parsing, and Named-Entity Recognition. In our case, we use spaCy along with its small pre-trained model for English language (en_core_web_sm) as a text pre-processor and tokenizer. The rest of tasks are handled by usual built-in Python libraries.

Our basic system consists of the following steps: – Use a domain-specific tokenizer such as ScispaCy

[45] for biomedical data. – Lemmatize or stem tokens before recording them in the vocabulary list and building the adjacency matrix, so that diferent versions of the same words (such as plural “solitons” and singular “soliton”) are mapped to the same record. (1.) Text pre-processing: stopword and punctuation re- • Add the post-processing step: moval. (2.) Text tokenization: tokenizing the text and build a

vocabulary list. (3.) Build the adjacency matrix from the graph.

• Matrix index in row and column: terms in the

vocabulary list. • Matrix entries: co-occurence of term pairs within

the same window of pre-specified size. (4.) Normalize the matrix and compute the stationary

distribution of the matrix. (5.) Retrieve keyword(s) corresponding to terms with

highest stationary probabilities.

The implemented code is stored as a Jupyter notebook and hosted on Google Colaboratory and allows the participant to test and work directly on the code online without local installation. There, the step-by-step description is provided and a code sanity check was performed. For example, our system extracts valid keywords “cute”, “dog”, “cat” (in descending order by term prominence) for a short text: “This is a very cute dog. This is another cute cat. This dog and this cat are cute”.

4.3. Further Ideas

– Exclude keywords that are too short. • Agglomerate keywords (and perhaps add back some stopwords) to form “keyphrases” (“the” and “of” should not be removed within “the Department of Health”).

Advanced participants are also directed to another Python package NetworkX, which has a built-in, computationally eficient implementation for the TextRank algorithm [46]. 4.4. Evaluation: Instance-Based Performance

In System 1, the objective is instance-based, that is, for each abstract, we need to evaluate how well the algorithm performs. The metric could be accuracy, that is, the ability to find as many keyphrases (compared to the reference list) as possible. We can also compute the precision and recall scores (micro or macro). We provide a simple baseline evaluation function in the notebook. Here, we allow fuzzy matching algorithms on the phrase level, where the cut-of ratio and the edit distance between the candidate term and the reference term can be adjusted.

5. System 2: TextRank with Clustering Inspired by existing keyword extraction systems in

Python such as summa [43] and pke [44], we have provided participants with a list of ideas to further im- In System 2, we extend the TextRank keyword extraction prove the keyword extraction system along with hints described in System 1 (see Section 4) and apply it to a for Python implementation using spaCy (see the Jupyter group of texts clustered by the K-Means algorithm. In this notebook): way, we obtain a more focused keyword list specifically for each text group and learn about its characteristics. • Improve the pre-processing step: – Remove numbers. – Standardize casings, such as lower-casing the entire

text. – Use a domain-specific or custom-made stopword

list. • Improve the tokenization step: – Filter by Part-of-Speech tags to only include nouns in the vocabulary list.

5.1. K-Means Algorithm

The K-Means algorithm is a clustering algorithm which partitions points in a vector space into “K” clusters (“K” being pre-specified), such that each point belongs to the cluster with the nearest cluster centroid (called “Means”) [36, 37]. It works in the following steps. (1.) Assign k random points as the cluster “means”.

(2.) Doing the following until the convergence: a) Assignment step: Assign each point to the clus- ofers several pre-trained models for diferent purposes, ter with the least squared Euclidean distance to from which we choose the small model (all-MiniLM-L6the cluster mean, v2). b) Update step: Recalculate the “mean” as the av- Second, to group the documents, we use the impleerage of all the points assigned to each cluster, mentation in the package sklearn [52]. Furthermore, we provide a cluster visualization using the package c) Terminate when the cluster assignment stabi- matplotlib [53]. We set the parameter = 7 for the lizes. K-Means algorithm, which is the number of disciplines We ultimately choose the K-Means algorithm for clus- in the WOS dataset. tering because of its low complexity: it works very fast Finally, we extract the keyphrases from each cluster. for large datasets like ours [47, 48]. Often, one hidden Unlike in System 1, we do not implement the TextRank alcaveat about the K-Means algorithm is the choice of the gorithm from scratch, but instead use the existing Python number of clusters “K”. However, in our specific use case package pke [44]. pke provides implementations of nuwith the scientific publications, we usually have a good merous keyword extraction algorithms from publications, estimate based on the number of target disciplines. There- as well as allowing customization such as Part-of-Speech fore, K-Means serves our purpose well. tag filters and the limit on the maximum number of words in a single keyphrase. In our case, we simply use the basic TextRank algorithm, also to demonstrate that even the 5.2. Preprocessing: Sentence-BERT very basic algorithm can already yield satisfying outputs.

Embeddings Like in System 1, the code implemented for System 2 is stored as a Jupyter notebook and hosted on Google Colaboratory. The step-by-step description is provided, and a code sanity check succeeds at characterizing a cluster: the cluster mostly consisting of medical articles has relevant keyphrases such as “patient group”, “treatment efects”, “autism patient” among the top-10 extracted keyphrase list.

As mentioned in the previous section, K-Means clusters

points in a vector space. Therefore, we need to transform each text in our dataset into a vector representation. This is often done by averaging pre-trained word embeddings over all the words that appear in the document, regardless of whether they are context-free embeddings like GloVe [49] or contextualized embeddings like BERT [50].

However, this has been shown to perform worse than directly deriving contextualized sentence embeddings 5.4. Further Ideas (Sentence-BERT [51]). Therefore, we opt for contextualized sentence embeddings from Sentence-BERT, which is trained on the Siamese BERT networks [51]. More technical details can be found in the original paper by N. • Customize the TextRank algorithm: Reimers and I. Gurevych [51].

The Sentence-BERT transforms each text into a 384- – Change the window size. dimensional semantically meaningful vector, which is now ready to be an input to the K-Means algorithm for clustering.

We invite participants to explore improvement ideas and provide coding hints on how to implement them on pke: • Use alternative keyword extraction algorithms to the TextRank algorithm, such as: 5.3. Implementation

We add the clustering step to our pipeline, which efectively results in the following procedure: (1.) For each document, extract its Sentence-BERT

embedding, (2.) Cluster the documents into K groups based on their Sentence-BERT embeddings, i.e., by the sentence contents, (3.) For each document cluster, extract its keyphrases.

First, we generate embedding representations for each Using a similar evaluation function as in System 1 (See text, which is very easy by the Python package sentence- Section 4.4), we now look at a cluster-based objective. transformers. The package sentence-transformers – The TopicRank algorithm [54], – The Multipartite algorithm [55], – The BERTopic algorithm [56]. • Impose extra criteria on valid keyphrases, such as: – Change the maximum number of words allowed in

a single keyphrase, – Restrict the keyphrase to only contain the top cer

tain percentage of all keywords.

5.5. Evaluation: Cluster-Based Performance

This means that we take all the keywords from the articles clustered in the same group and build a new reference list of keywords. Subsequently, the evaluation of the user-generated list will be compared with this expanded list. Notably, this approach increases the coverage of keywords in the reference, in the hope of covering more out-of-abstract keywords in this expanded list. However, it comes at the cost of increasing the denominator when we compare the user-generated list to the reference list. One way to better present the reference list of one cluster is to process the list by criteria such as frequency. Another way to evaluate is using word embedding similarities (c.f. KeyBert [33] as an example of leveraging embeddings). In this way, we have a better view of the extracted keywords and the degree to which the usergenerated list is close to the reference list. In particular, this technique is useful for assessing the diference set between the user-generated list and the reference one.

6.1. Named Entities, Named-Entity Recognition and Keyword Extraction

6. System 3: Named-Entity are not limited by the fixed categories of an NER model, Recognition as Keyword and may contain named entities if those entities are representative of a given document. For example, a document Extraction about Heathrow Airport can contain keywords such as “arrival”, “customs”, “departure”, “duty free”, “immigraThe goal of system 3 is to emulate some of the constraints tion” and “London”. Depending on the model classes, an that may exist in a practical setting. These could be sit- NER model on the same text could extract entities such uations where a keyword extractor system cannot be as “British Airways” (ORG), “London” (LOC), “United implemented as the output of these systems may be in- Kingdom” (LOC), etc. In this example, there is overlap correct or non-sensical. Another situation could be that between the keywords and named entities; however, due one is required to use existing tools such as a Named- to the defining characteristics of both approaches, there Entity Recognition system and must enact measures to is a significant diference between the lists. improve the output of the model. Figure 2 demonstrates the use of keyword extraction and named-entity recognition in the industry setting at Neue Zürcher Zeitung (NZZ), where key terms are extracted and relevant articles are assigned to the terms.

A named entity (NE) in most cases is a proper noun, the 6.2. Use of Keywords in the News Domain most common categories being person, location and organization; however, other categories that are not proper As mentioned above, for a given text, keywords and the nouns, such as temporal expressions, are also possible. output of a NER model may overlap. When it comes Named-Entity Recognition consists of locating and classi- to analyzing news, a typical NER model (with common fying named entities mentioned in unstructured text into categories such as person, organization, and location) predefined categories [ 57, Chapter. 8.3]. Keywords are excels at finding named entities for the model-specific single or multi-word expressions that under ideal circum- categories. However, only extracting the entities is inadstances should concisely represent the key content of a equate for finding nuanced diferences between multiple document [58, Page 3]. As the goal of NER is to assign articles that contain identical named entities. In Table 2 a label to spans of text [57, Chapter. 8.3], it is a classi- we see the titles of 10 articles published in Neue Zürcher ifcation task that can be solved by building a machine Zeitung (NZZ) during March 2022. According to the NER learning model [59]. model for German texts used internally by the NZZ, all

The diference between keyword extraction and NER articles have “Ukraine” (location) as a common named is as follows. Named entities are words or phrases with a entity. Despite the similarities, there are thematic difspecific label determined by predefined classes of a given ferences between these articles. After using a keyword NER model. Therefore, these entities may not necessarily extraction system that uses similar methodologies menrepresent the essential content of a document. Keywords tioned in Systems 1 and 2, keywords that are not named Number NZZ Article Title 1 Eine Zürcherin n«iEminmetSuoklirdaainriitsäcthsebeFklüucnhdtulinnggeauafufIn–stuangdrafmühzlutspicohstveonm,reSitcahattnailclheitn»g:elassen 1325467980 1SSPKVCNN5iiuriheecce0ituuhhreliieUgntteesrrrr,tZkaaihhodnrllüpeeiiaittrediihiääcttnessStthBerpp:cei-lmooUFhrFolliawckliinüttnUhrneiicadkkekiehzi::rrnnit:anSMuleNiiuon:nndinledlWgeeiiddu-terdKäatiZSesirrrreaiciiüinessehlzrciicdgtcwwnhhähi-eeeetSmeiiwVNirzNsSKPomtgeee-iueunignDötrttieedcacrronnhaaehdralrltftediideuttFneenääosrluittrüntFfNrwecel–hüSheeZetcutiwltlhetdibeniritiestralegeditlngzeniSetwegwncäahehektuetcweafnkennudesisiemzwd?smeetarittteUtrk?raine bei sich zu Hause aufnehmen

6.4. Pre-Trained NER Models There are some disadvantages to using pre-trained NER

models. One should take into consideration that using a pre-trained model to extract named entities out of documents from diferent domains can result in a fall in model performance [65]. The training data and categories of Table 2 the model will influence the output. For example, the Titles of 10 articles published in Neue Zürcher Zeitung (NZZ) string “ATP” can be labeled as an organization (e.g. Asduring March 2022. sociation of Tennis Professionals) by one model and as a chemical (e.g. adenosine triphosphate) by a biomedicalNER model. Creating an NER model for a specific type entities were found. These keywords demonstrate the- of entity requires the annotation of a corpus, which can matic groupings between the articles. The most common be a significant expense and efort for the user [ 65]. keyword for articles 1-4 is “Flüchtlinge” (“refugees”), and for articles 5-10 is “Neutralität” (“neutrality”). This difer- 6.5. Further Ideas ence can also be observed in the article titles, and upon closer inspection of the article content, it is evident that some of the articles (1-4) revolve around the topic of refugees from Ukraine, while other articles (5-10) discuss the notion of neutrality. Using named entities or, in some cases, a predefined list of keywords can be useful to deifne broad topic pages (see nzz.ch/themen), but keywords ofer concise yet semantically insights into the content of a document. Therefore, they can be potentially used to automatically identify possible subtopics with a news story or discover emerging topics from newly published articles.

The challenge of this system lies in working with pre

calculated data from systems that cannot be influenced. The participants are provided with multiple tables with the output of two diferent NER systems, fastText document, and word vectors (see Section 6.3). In addition, they also have a table at their disposal to verify whether a keyword for a given document is present in the abstract and whether it was discovered by any of the NER models (with 100% string matches). The intuition of System 3 is that given the resources (cost, time, hardware), one needs to come up with the best possible strategies to detect meaningful keywords.

6.3. Data Preparation 2https://fasttext.cc/ (last accessed: June 20, 2022).

3https://fasttext.cc/docs/en/crawl-vectors.html (last accessed: June 20, 2022).

6.6. Evaluation: Instance-based Performance The FLAIR framework [60] was chosen as it contains

many out-of-the-box NER models for generic and biomedical texts. Furthermore, the framework is also useful In addition to the pre-calculated data, the participants for integrating pre-trained embeddings and models. As were also given evaluation functions to compare difermany of the texts are from the biomedical domain, the ences between their system NER model output and the ScispaCy library was used for word and sentence tok- keyword list that came with the documents. There are enization [61]. The results of the NER models were given cases where an item from the curated keyword list does to the participants. The ner-english model is a 4-class not contain the keyword in the abstract, or contains a NER model for English, which comes with FLAIR [62]. partial or inflected form of the keyword. The evaluation This model has the following categories: locations (LOC), function contains a partial string matching sequence, persons (PER), organizations (ORG), and miscellaneous where one can choose the amount of character similarity (MISC) [63]. We also provided participants with NER between two strings. For example, a document has the results from HunFlair [38], which is an NER tagger for label “radio frequency”, but the string “radio frequenbiomedical texts. This biomedical NER tagger is based on cies” is present in the abstract and the inflected form the HUNER tagger, and has the follwing named-entity cat- was also found by one of the NER models. For this case, egories: Chemicals, Diseases, Species, Genes or Proteins, participants can set a string similarity value (e.g., 80% and Cell lines [64]. As an additional hint to participants, similarity) to circumvent the issues caused by inflected document embeddings for each item in the train and test forms, or partially mentioned forms (“radio frequency” sets, as well as word embeddings for the entire corpus, vs. “radio frequency scanner”). Using the resources at were generated from a fastText model2 trained on the their disposal, participants must develop the best possible English Common Crawl dataset (cc.en.300.bin)3. strategies to build a system that can detect the maximum number of relevant keywords.

7. Participant Contributions Our participants have further investigated keyphrase extractions in System 1 and provided valuable contributions to our proceedings. Their original theses can be founded at the following Google Drive folder.

The basic TextRank keyword extractor in System 1 has been extended to account for the following data preprocessing steps: (1) remove numbers; (2) restrict valid keywords to only nouns; (3) restrict valid keywords by imposing the minimum string length. The contribution can be found on the Google Drive folder.

Additionally, the evaluation system has been generalized to output numerical performance scores, allowing simpler comparisons of diferent keyword extractors. The contribution can be found on the Google Drive folder.

Finally, a comparison between the TextRank algorithm and further unsupervised keyphrase extraction methods has been provided. The limitation of TextRank is that it only considers the co-occurences of the word pair and not the semantical meanings, which may cause certain extracted “frequent” word pairs to either be irrelevant or under-represented. Therefore, an experiment has been performed using the pke library to compare the performance of the TextRank algorithm and several other unsupervised keyphrase extraction algorithms on the benchmark test dataset. The contribution can be found on the Google Drive folder.

Beyond the academic setting, the use of keyword extractions is demonstrated in the industry setting, where Wyona AG utilizes keyword extractors in the working pipeline of the Q&A Chatbot “Katie”. The contribution can be found on the Google Drive folder.

8. Conclusion In this workshop, we provided the background and base

line systems for keyword extraction, shared a benchmark dataset on scientific keyword extraction, and invited contributions from participants from industry and academia. The methodologies discussed can be extended to keyword extraction in other domains (e.g., legal and news).

Acknowledgements The authors would like to thank the organizers from Swis

sText2022 for hosting our workshop. Peter Egger and the Chair of Applied Economics acknowledge the support of the Department of Management, Technology, and Economics at ETH Zurich. Ce Zhang and the DS3Lab gratefully acknowledge the support from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number MB22.00036 (for European Research Council (ERC) Starting Grant TRIDENT 101042665), the

Swiss National Science Foundation (Project Number

200021_184628, and 197485), Innosuisse/SNF BRIDGE Discovery (Project Number 40B2-0_187132), European Union Horizon 2020 Research and Innovation Programme (DAPHNE, 957407), Botnar Research Centre for Child Health, Swiss Data Science Center, Alibaba, Cisco, eBay, Google Focused Research Awards, Kuaishou Inc., Oracle Labs, Zurich Insurance, and the Department of Computer Science at ETH Zurich. We would like to thank Neue Zürcher Zeitung for collaborating on this project. Y. Zhang, C. Zhang, P. Mayr, A. Suominen (Eds.), [42] I. Montani, M. Honnibal, M. Honnibal, S. V. Proceedings of the 1st Workshop on AI + Informet- Landeghem, A. Boyd, H. Peters, P. O. McCann, rics (AII2021) co-located with the iConference 2021, M. Samsonov, J. Geovedi, J. O’Regan, D. Altinok, Virtual Event, March 17th, 2021, volume 2871 of G. Orosz, S. L. Kristiansen, D. de Kok, L. MiCEUR Workshop Proceedings, CEUR-WS.org, 2021, randa, Roman, E. Bot, L. Fiedler, G. Howard, Edpp. 58–70. ward, W. Phatthiyaphaibun, R. Hudson, Y. Tamura, [31] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, S. Bozek, murat, R. Daniels, P. Baumgartner, J. Oyarzábal, A. Valencia, Chemdner: The drugs M. Amery, B. Böing, explosion/spaCy: New and chemical names extraction challenge, Journal Span Ruler component, JSON (de)serialization of of Cheminformatics 7 (2015) S1 – S1. Doc, span analyzer and more, 2022. doi:10.5281/ [32] M. F. Porter, An Algorithm for Sufix Stripping, Mor- zenodo.6621076.

gan Kaufmann Publishers Inc., San Francisco, CA, [43] F. Barrios, F. López, L. Argerich, R. Wachenchauzer, USA, 1997, p. 313–316. Variations of the similarity function of textrank for [33] M. Grootendorst, Keybert: Minimal keyword ex- automated summarization, CoRR abs/1602.03606 traction with bert., 2020. doi:10.5281/zenodo. (2016). arXiv:1602.03606.

4461265. [44] F. Boudin, pke: an open source python-based [34] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja- keyphrase extraction toolkit, in: Proceedings of fari Meimandi, , M. S. Gerber, L. E. Barnes, Hdltex: COLING 2016, the 26th International Conference Hierarchical deep learning for text classification, on Computational Linguistics: System Demonstrain: Machine Learning and Applications (ICMLA), tions, Osaka, Japan, 2016, pp. 69–73. 2017 16th IEEE International Conference on, IEEE, [45] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis2017. paCy: Fast and robust models for biomedical natu[35] R. Mihalcea, P. Tarau, TextRank: Bringing order ral language processing, in: Proceedings of the 18th into text, in: Proceedings of the 2004 Conference BioNLP Workshop and Shared Task, Association on Empirical Methods in Natural Language Pro- for Computational Linguistics, Florence, Italy, 2019, cessing, Association for Computational Linguistics, pp. 319–327. doi:10.18653/v1/W19-5034.

Barcelona, Spain, 2004, pp. 404–411. [46] A. Hagberg, P. Swart, D. S Chult, Exploring network [36] S. Lloyd, Least squares quantization in pcm, structure, dynamics, and function using networkx IEEE Transactions on Information Theory 28 (1982) (2008).

129–137. doi:10.1109/TIT.1982.1056489. [47] D. Xu, Y. Tian, A comprehensive survey of clus[37] J. MacQueen, Classification and analysis of multi- tering algorithms, Annals of Data Science 2 (2015) variate observations, in: 5th Berkeley Symp. Math. 165–193. doi:10.1007/s40745-015-0040-1.

Statist. Probability, 1967, pp. 281–297. [48] A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade, [38] L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, L. Abualigah, J. O. Agushaka, C. I. Eke, A. A.

U. Leser, A. Akbik, HunFlair: an easy-to-use Akinyelu, A comprehensive survey of clustering tool for state-of-the-art biomedical named entity algorithms: State-of-the-art machine learning aprecognition, Bioinformatics 37 (2021) 2792–2794. plications, taxonomy, challenges, and future redoi:10.1093/bioinformatics/btab042. search prospects, Engineering Applications of Ar[39] J. Son, Y. Shin, Music lyrics summarization method tificial Intelligence 110 (2022) 104743. doi: https: using textrank algorithm, Journal of Korea Multi- //doi.org/10.1016/j.engappai.2022.104743. media Society 21 (2018) 45–50. doi:https://doi. [49] J. Pennington, R. Socher, C. Manning, GloVe: Global org/10.9717/kmms.2018.21.1.045. vectors for word representation, in: Proceedings of [40] C. Wu, L. Liao, F. Afedzie Kwofie, F. Zou, Y. Wang, the 2014 Conference on Empirical Methods in NatuM. Zhang, Textrank keyword extraction method ral Language Processing (EMNLP), Association for based on multi-feature fusion, in: X.-S. Yang, Computational Linguistics, Doha, Qatar, 2014, pp. S. Sherratt, N. Dey, A. Joshi (Eds.), Proceedings of 1532–1543. doi:10.3115/v1/D14-1162. Sixth International Congress on Information and [50] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Communication Technology, Springer Singapore, Pre-training of deep bidirectional transformers for Singapore, 2022, pp. 493–501. language understanding, in: Proceedings of the [41] S. Pan, Z. Li, J. Dai, An improved textrank key- 2019 Conference of the North American Chapwords extraction algorithm, in: Proceedings of the ter of the Association for Computational LinguisACM Turing Celebration Conference - China, ACM tics: Human Language Technologies, Volume 1 TURC ’19, Association for Computing Machinery, (Long and Short Papers), Association for CompuNew York, NY, USA, 2019. doi:10.1145/3321408. tational Linguistics, Minneapolis, Minnesota, 2019, 3326659. pp. 4171–4186. doi:10.18653/v1/N19-1423.

[1]

T. D.

Nguyen , M.-

Kan , Keyphrase extraction in scientific publications , in: D. H. -L. Goh , T. H.

Cao , I. T.

Sølvberg , E. Rasmussen (Eds.), Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers , Springer Berlin Heidelberg, Berlin, Heidelberg, 2007 , pp. 317 - 326 .

[2]

S. N.

Kim , M.-

Kan , Re-examining automatic keyphrase extraction approaches in scientific articles , in: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009 ), Association for Computational Linguistics , Singapore, 2009 , pp. 9 - 16 .

[3]

Frank ,

G. W.

Paynter ,

I. H.

Witten ,

Gutwin ,

C. G.

Nevill-Manning , Domain-specific keyphrase extraction , in: T. Dean (Ed.), Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99 , Stockholm, Sweden, July 31 - August 6 , 1999 . 2 Volumes, 1450 pages, Morgan Kaufmann, 1999 , pp. 668 - 673 .

[4]

Gutwin , G. Paynter, I. Witten,

Nevill-Manning , E. Frank, Improving browsing in digital libraries with keyphrase indexes, Decision Support Systems 27 ( 1999 ) 81 - 104 . doi:https://doi.org/10.1016/ S0167- 9236 ( 99 ) 00038 - X .

[5]

Medelyan ,

I. H.

Witten , Domain-independent automatic keyphrase indexing with small training sets , J. Am. Soc. Inf. Sci. Technol . 59 ( 2008 ) 1026 - 1040 .

[6]

Borisov ,

Aliannejadi ,

Crestani , Keyword extraction for improved document retrieval in conversational search , CoRR abs/2109 .05979 ( 2021 ).

[7]

Han , T . Kim,

Choi , Web document clustering by using automatic keyphrase extraction , in: 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops , 2007 , pp. 56 - 59 . doi: 10 .1109/WI- IATW. 2007 . 46 .

[8] K. M. Hammouda , D. N.

Matute , M. S.

Kamel , Corephrase: Keyphrase extraction for document clustering , in: P. Perner , A . Imiya (Eds.), Machine Learning and Data Mining in Pattern Recogni-