Keyword Extraction in Scientific Documents Susie Xi Rao1,∗ , Piriyakorn Piriyatamwong1,∗ , Parijat Ghoshal2,∗ , Sara Nasirian3 , Sandra Mitrović3 , Emmanuel de Salis4 , Michael Wechner5 , Vanya Brucker5 , Peter Egger1 and Ce Zhang1 1 Chair of Applied Economics, ETH Zurich, Switzerland 2 Neue Zürcher Zeitung AG, Zurich, Switzerland 3 Dalle Molle Institute for Artificial Intelligence, Lugano, Switzerland 4 Haute-Ecole Arc, Neuchâtel, Switzerland 5 Wyona AG, Zurich, Switzerland Abstract The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications. 1. Introduction we see a large variation across domains (e.g., economics, computer science, mathematics, engineering fields, hu- Keyphrases are single- or multi-word expressions (of- manities). For instance, publications in some disciplines, ten nouns) that capture the main ideas of a given text, such as economics, are required to have author-generated but do not necessarily appear in the text itself [1, 2, 3]. or journal-curated keywords, while in other domains, Keyphrases have been shown to be useful for many tasks such as computer science and engineering fields, not all in the Natural Language Processing (NLP) domain, such publication venues (e.g., journals, proceedings) require as (1.) indexing, archiving and pinpointing information authors to input keywords. in the Information Retrieval (IR) domain [3, 4, 5, 6], (2.) In less technical domains, such as news media, document clustering [3, 7, 8], and (3.) summarizing texts keyphrase lists may be more accessible in terms of [3, 9, 10, 11], just to name a few. the availability and the ease of manually curating the Keyphrase extraction has been at the forefront of vari- keyphrase list, even when reference lists are not readily ous application domains, ranging from the scientific com- available. This is because in the news domain, people munity [1, 2, 12], finance [13, 14], law [15], news media have particular interests in Named Entities (labelled en- [11, 16, 17], patenting [18, 19], and medicine [20, 21, 22]. tities such as person, location, event, time), as we will Despite being a seemingly straightforward task for hu- discuss in Section 6. However, manually curating the man domain experts, performing automatic keyphrase keyphrase list in general is often practically infeasible– extraction is a challenging task. hiring domain experts is costly, while crowdsourcing the annotation is difficult to control the quality [2, 3, 11]. Challenge 1: Benchmark Dataset and Keyword Ref- With limited availability of benchmark datasets, large erence List. One main reason is the lack of benchmark language models–which succeed in other NLP tasks– datasets and keyword reference lists, as authors often simply fail to optimize and generalize, as they generally do not provide their keyphrase list unless explicitly re- require a large, well-annotated training dataset [16]. The quested or required to do so [3]. In scientific publications, lack of training datasets also poses challenges for the evaluation of keyword extraction systems. SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, Lugano, Switzerland Challenge 2: Evaluation of Keyword Extraction. ∗ Corresponding author. † Defining an evaluation protocol and a corresponding These authors contributed equally. Envelope-Open srao@ethz.ch (S. X. Rao); ppiriyata@ethz.ch (P. Piriyatamwong); metric is far from trivial for the following reasons. parijat.ghoshal@nzz.ch (P. Ghoshal); sara.nasirian@supsi.ch (S. Nasirian); sandra.mitrovic@supsi.ch (S. Mitrović); (1.) We should look at the ground truth list of keywords emmanuel.desalis@he-arc.ch (E. d. Salis); in a critical way. As we mentioned above, there can michael.wechner@wyona.com (M. Wechner); exist more than one ground truth list of keyphrases vanya.brucker@wyona.com (V. Brucker); pegger@ethz.ch given an abstract. The keyword list provided in (P. Egger); cezhang@ethz.ch (C. Zhang) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License our dataset is a reference list of words provided by CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) authors or by publishers. One should only treat (a) Web of Science. (b) Google Scholar. (c) Scopus. (d) Microsoft Academic. Figure 1: Comparison of various academic products with the query for “data mining”. this list as a reference list, but not the one and only cally equivalent matches [27, 28, 24]. There are other correct list of keywords. evaluation methods which account for the ranks and orders in the extracted keywords, see this Medium (2.) There are different aims in extracting keyphrases article for inspiration [24]. in system design. As we will introduce in the ra- tionale of designing the three systems in Section 3, Challenge 3: Growing Number of Scientific Publi- the systems are designed to tackle various problems cations. During the last decades, the number of scien- and, therefore, are optimized for different use cases. tific publications has increased exponentially each year System 1 uses a simple TextRank algorithm (see Sec- [29], making it increasingly challenging for researchers tion 4), which outputs the most prominent set of to keep track of trends and changes, even strictly in their keyphrases/keywords; System 2 uses TextRank on own field of interest [3, 30]. This bolsters the need for top of a clustering algorithm (see Section 5), which is automatic keyword extraction for the use case as a text targeted at grouping similar articles and then learns recommendation and summarization system. The effect from the cluster of articles; and System 3 uses pre- of increasing publications is clearly visible in major aca- trained models and tools on Named-Entity Recogni- demic search engines such as Google Scholar, Web of tion (NER) (see Section 6), with a goal to fully utilize Science, Scopus, and Microsoft Academics. In a simple existing models and tools by only pre-processing the query (“data mining”), three out of four failed to bring input and/or post-processing the output. up relevant scientific publications that are prominent in (3.) There are different objective functions that we want the field and anticipated by human domain experts. to optimize. Precision, recall, accuracy, false posi- See the query results in Figure 1 of a keyword search tive rate, and false negative rate are among the most “data mining” in different academic products. We can see common performance metrics for various applica- that the search results in different products vary largely, tion scenarios [23]. We might also consider the order and it could be difficult for readers to choose between the of keyphrases, for example, as sorted by criteria such different results without having prior knowledge of the as frequency, TextRank score [24, 25]. In search en- field. So far, only Microsoft Academic Services (Figure 1 gines, the hit rate is also an important metric [26]. (d)) has returned relevant research results that point to Furthermore, one can evaluate exact matches and the most influential author and work in the field of data fuzzy matches. Fuzzy matches can also be broken mining. This is because Microsoft Academic Service has down into two types: “partial” matches and semanti- enabled a hierarchical discipline classification (indexed by keyphrases) that supports its users when reviewing (2.) We introduce three commonly used systems in the search results. In summary, without relevant and academia and industry for keyword extraction. For correct keyphrases, effective indexing and thus querying the various use cases of keyword extraction, we also is not feasible. design baseline evaluation metrics for each system. (3.) We encourage participants to discuss, extend, and Challenge 4: Domain-Specific Keyword Extraction. evaluate the systems that we have introduced. Another challenge in keyphrase extraction is its domain- specific nature. One case is when a keyphrase extractor trained in generic texts may miss out technical terms that System Design of Keyword Extraction. For the key- do not look like usual keyword noun chunks, such as the word extraction, we provide two systems based on the chemical name “C4H*Cl” [31]. The issue arises from the unsupervised, graph-based algorithm TextRank [35]. Sys- tokenization step: a non-alphabetic character such as “4” tem 1 (see Section 4) is to develop the TextRank keyword and “*” might be treated as a separator, and thus such extractor from scratch in order to understand the rea- a keyword gets split into “C”, “H” and “Cl”, losing its soning behind it. System 2 (see Section 5) combines the original notion. Even if the separator works perfectly, TextRank algorithm with the K-Means clustering algo- this type of chemical name would still confuse keyphrase rithm [36, 37] to provide keyphrases for each specific extractors that filter candidate keyphrase based on Part- field (“cluster”). In System 3 (see Section 6), we cover of-Speech (POS) tags. This is because for POS-based the NER task, where an entity in the sentence is identi- extractors, it is unclear whether “C4H*Cl” is an adjective, fied as person, organization, and others from predefined a noun or other POS tags. categories. We will focus primarily on the biomedical Another case is when the keyphrase consists of a domain using the state-of-the-art biomedical NER tool mix of generic and specific words, such as “Milky Way”. called HunFlair [38]. We also provide some baseline NERs “Way” is generally a stopword [32], so the keyphrase ex- for participants to evaluate. tractor might only be able to detect “Milky” and throw Beyond this workshop, the keyphrase extraction and away “Way” without realizing that the term “Way” is not NER methods we present are applicable to other text a stopword in this specific context. corpora, including media texts and legal texts; one only Finally, we would like to mention KeyBERT, a state-of- has to aware the domain-specific nature and properly the-art BERT-based keyword extractor [33]. KeyBERT adjust the algorithm pipeline. As such, we have linked works by extracting multi-word chunks whose vector the 20 newsgroup text dataset for the participants to try embeddings are most similar to the original sentence. their keyphrase extraction system on. Without considering the syntactic structure of the text, KeyBERT sometimes outputs keyphrases that are incor- 2. Benchmark Dataset rectly trimmed, such as “algorithm analyzes”, “learning machine learning”. This problem only worsens with the We take a subset of 46,985 records from the Web of Sci- aforementioned examples from chemistry and astronomy, ence dataset (WOS). The original WOS dataset is provided since it is not straightforward how to tokenize, i.e., “split”, by Kamran Kowsari in the HDLTex: Hierarchical Deep words and how to handle non-alphabetic characters. Learning for Text Classification paper [34]. The original data was provided in .txt format. Our Goals and Contributions in this Workshop. For the ease of work, we have pre-processed the origi- Despite the challenges, keyphrase extraction is an im- nal data and store it into .csv dataframe format, which portant step for many downstream tasks, as already de- would be most compatible with our Python working scribed. In this workshop, we aim to cover the founda- setup. The final dataframe is in the format as in Table 1, tions of keyphrase extraction in scientific documents and where (1) each record corresponds to a single scientific provide a discussion venue for academia and industries document, and (2) has the following columns: on the topic of keyword extraction. Our contributions in the workshop are as follows. • Domain : the domain the document belongs to, (1.) We make a new use of the existing dataset from the • area : the sub-domain the document belongs to, Web of Science (WOS) [34]. This dataset has been • keywords : the list of keyphrases provided by the au- used as a benchmark dataset for hierarchical classi- thors, stored as a single string with separator “;”, fication systems. Since it comes with reference lists of keywords, we utilize it as a benchmark dataset • Abstract : the abstract of the document. for keyword extraction. In this workshop, together with the participants, we study the feasibility of that Columns Y1 and Y2 which are simply the index of dataset in three systems. column Domain and area , respectively. Column Y are the Y1 Y2 Y Domain area keywords Abstract 5 50 122 Medical Sports Injuries Elastic therapeutic tape; Material properties; Tension test The aim of this study was to analyze stabilometry in athletes... This study examined the influence of range of motion of 5 48 120 Medical Senior Health Sports injury; Athletes; Postural stability the ankle joints on elderly people’s balance ability... Table 1 A sample of the WOS benchmark dataset. sub-sub-domain, which we do not use here but includes cases in the NLP domain including webpage ranking (bet- for reference. ter known as PageRank), extractive text summarization, In the corpus, we are provided with scientific arti- and keyword extraction [35, 39, 19, 17, 40, 41]. Across cles from seven domains: Medical, Computer Science different use cases, the base TextRank algorithm remains (CS), Biochemistry, Psychology, Civil, Electronics and the same; one only needs to adjust what is designated Communication Engineering (ECE), and Mechanical and as nodes, edges, and edge weights when constructing Aerospace Engineering (MAE). Therefore, column Y1 con- the graph from the text corpus. The higher edge weight sists of unique values from 0 to 6. means the higher chance of choosing this particular edge In Table 1, note that both records have the same do- to proceed to the next node. For example, in the web con- main Y1 as “5” corresponding to Domain as “Medical”. text, the PageRank Algorithm considers different web- Their sub-domain Y2 differs: the first record is about pages as nodes and the hyperlinks between webpage “Sports Injuries”, while the second record is about “Se- pairs as edges. Here, the edges are asymmetrically di- nior Health”. keywords and Abstract of each record rected, since there could be a hyperlink from one page match its sub-domain. to another but not necessarily vice versa. The edges can Finally, the records are splitted at the ratio 70:30 into then be weighted by the number of hyperlinks. the train/test sets with 32,899 and 14,096 abstracts, re- In our keyword extraction, the TextRank algorithm spectively. We provide the training set with keywords works by considering terms in text as graph nodes, term column to the participants for the training of their key- co-occurence as edges, and the number of co-occurence word and/or NER extraction system, and the test set of two terms within a certain window as the edge for the participants to evaluate the system. The reason weights. Note that the co-occurence window is a fixed for splitting the dataframe is so that the participants do pre-specified size (say, 5-gram within sentence boundary). not overfit their system towards the whole dataset. We Based on this notion, the graph is treated as weighted encourage them to design their system based on the fea- but undirected. tures learnt from the training set and apply the identical Subsequently, each term score is given by how “likely” pipeline to the test set. an agent, starting at a random point in the graph and continuously jumping along the weighted edges, will end up at that term node after a long time horizon. The terms 3. Systems with higher scores are then considered more important, that is, the “keywords” extracted by the TextRank sys- Now we discuss the three systems we provide to the tem.1 participants as simple baselines for keyword extraction using the benchmark dataset. Certainly, there are vari- ous possible extensions to them. We list the participant 4.2. Implementation contributions under Section 7. We implement a very basic keyword extraction system based on the TextRank algorithm from scratch, in order 4. System 1: TextRank Algorithm for the participants to get hands-on experience on how the algorithm works. Subsequently, we propose addi- In System 1, we build the TextRank algorithm from tional improvement ideas so that participants have the scratch and add customizations to our needs, e.g., fil- opportunity to be creative and improve the basic system. tering by Part-of-Speech tags. For implementation, we mainly use the Python pack- age for natural language processing called spaCy [42]. spaCy utilizes pre-trained language models to perform 4.1. TextRank many NLP tasks, among other things, Part-of-Speech tag- The TextRank algorithm is a graph-based algorithm which, as the name suggests, is used to assign scores to 1 In the web analogy, the webpage score would correspond to the texts, thereby giving a ranking [35]. It has numerous use chance that an Internet user would end up in that webpage after continuously browsing through the hyperlinks. In this sense, we retrieve the most popular webpages. ging (PoS tagging), semantic dependency parsing, and – Use a domain-specific tokenizer such as ScispaCy Named-Entity Recognition. In our case, we use spaCy [45] for biomedical data. along with its small pre-trained model for English lan- – Lemmatize or stem tokens before recording them guage (en_core_web_sm ) as a text pre-processor and to- in the vocabulary list and building the adjacency kenizer. The rest of tasks are handled by usual built-in matrix, so that different versions of the same words Python libraries. (such as plural “solitons” and singular “soliton”) are Our basic system consists of the following steps: mapped to the same record. (1.) Text pre-processing: stopword and punctuation re- • Add the post-processing step: moval. – Exclude keywords that are too short. (2.) Text tokenization: tokenizing the text and build a • Agglomerate keywords (and perhaps add back some vocabulary list. stopwords) to form “keyphrases” (“the” and “of” should (3.) Build the adjacency matrix from the graph. not be removed within “the Department of Health”). • Matrix index in row and column: terms in the Advanced participants are also directed to another vocabulary list. Python package NetworkX , which has a built-in, com- • Matrix entries: co-occurence of term pairs within putationally efficient implementation for the TextRank the same window of pre-specified size. algorithm [46]. (4.) Normalize the matrix and compute the stationary distribution of the matrix. 4.4. Evaluation: Instance-Based Performance (5.) Retrieve keyword(s) corresponding to terms with highest stationary probabilities. In System 1, the objective is instance-based, that is, for each abstract, we need to evaluate how well the algo- The implemented code is stored as a Jupyter notebook rithm performs. The metric could be accuracy, that is, and hosted on Google Colaboratory and allows the partic- the ability to find as many keyphrases (compared to the ipant to test and work directly on the code online without reference list) as possible. We can also compute the pre- local installation. There, the step-by-step description is cision and recall scores (micro or macro). We provide provided and a code sanity check was performed. For ex- a simple baseline evaluation function in the notebook. ample, our system extracts valid keywords “cute”, “dog”, Here, we allow fuzzy matching algorithms on the phrase “cat” (in descending order by term prominence) for a level, where the cut-off ratio and the edit distance be- short text: “This is a very cute dog. This is another cute tween the candidate term and the reference term can be cat. This dog and this cat are cute”. adjusted. 4.3. Further Ideas 5. System 2: TextRank with Inspired by existing keyword extraction systems in Clustering Python such as summa [43] and pke [44], we have pro- vided participants with a list of ideas to further im- In System 2, we extend the TextRank keyword extraction prove the keyword extraction system along with hints described in System 1 (see Section 4) and apply it to a for Python implementation using spaCy (see the Jupyter group of texts clustered by the K-Means algorithm. In this notebook): way, we obtain a more focused keyword list specifically for each text group and learn about its characteristics. • Improve the pre-processing step: – Remove numbers. 5.1. K-Means Algorithm – Standardize casings, such as lower-casing the entire The K-Means algorithm is a clustering algorithm which text. partitions points in a vector space into “K” clusters (“K” – Use a domain-specific or custom-made stopword being pre-specified), such that each point belongs to the list. cluster with the nearest cluster centroid (called “Means”) [36, 37]. It works in the following steps. • Improve the tokenization step: – Filter by Part-of-Speech tags to only include nouns (1.) Assign k random points as the cluster “means”. in the vocabulary list. (2.) Doing the following until the convergence: a) Assignment step: Assign each point to the clus- offers several pre-trained models for different purposes, ter with the least squared Euclidean distance to from which we choose the small model (all-MiniLM-L6- the cluster mean, v2 ). b) Update step: Recalculate the “mean” as the av- Second, to group the documents, we use the imple- erage of all the points assigned to each cluster, mentation in the package sklearn [52]. Furthermore, we provide a cluster visualization using the package c) Terminate when the cluster assignment stabi- matplotlib [53]. We set the parameter 𝐾 = 7 for the lizes. K-Means algorithm, which is the number of disciplines We ultimately choose the K-Means algorithm for clus- in the WOS dataset. tering because of its low complexity: it works very fast Finally, we extract the keyphrases from each cluster. for large datasets like ours [47, 48]. Often, one hidden Unlike in System 1, we do not implement the TextRank al- caveat about the K-Means algorithm is the choice of the gorithm from scratch, but instead use the existing Python number of clusters “K”. However, in our specific use case package pke [44]. pke provides implementations of nu- with the scientific publications, we usually have a good merous keyword extraction algorithms from publications, estimate based on the number of target disciplines. There- as well as allowing customization such as Part-of-Speech fore, K-Means serves our purpose well. tag filters and the limit on the maximum number of words in a single keyphrase. In our case, we simply use the basic TextRank algorithm, also to demonstrate that even the 5.2. Preprocessing: Sentence-BERT very basic algorithm can already yield satisfying outputs. Embeddings Like in System 1, the code implemented for System 2 is stored as a Jupyter notebook and hosted on Google As mentioned in the previous section, K-Means clusters Colaboratory. The step-by-step description is provided, points in a vector space. Therefore, we need to transform and a code sanity check succeeds at characterizing a clus- each text in our dataset into a vector representation. This ter: the cluster mostly consisting of medical articles has is often done by averaging pre-trained word embeddings relevant keyphrases such as “patient group”, “treatment over all the words that appear in the document, regard- effects”, “autism patient” among the top-10 extracted less of whether they are context-free embeddings like keyphrase list. GloVe [49] or contextualized embeddings like BERT [50]. However, this has been shown to perform worse than directly deriving contextualized sentence embeddings 5.4. Further Ideas (Sentence-BERT [51]). Therefore, we opt for contextual- We invite participants to explore improvement ideas and ized sentence embeddings from Sentence-BERT, which provide coding hints on how to implement them on pke : is trained on the Siamese BERT networks [51]. More technical details can be found in the original paper by N. • Customize the TextRank algorithm: Reimers and I. Gurevych [51]. The Sentence-BERT transforms each text into a 384- – Change the window size. dimensional semantically meaningful vector, which is • Use alternative keyword extraction algorithms to the now ready to be an input to the K-Means algorithm for TextRank algorithm, such as: clustering. – The TopicRank algorithm [54], 5.3. Implementation – The Multipartite algorithm [55], – The BERTopic algorithm [56]. We add the clustering step to our pipeline, which effec- tively results in the following procedure: • Impose extra criteria on valid keyphrases, such as: (1.) For each document, extract its Sentence-BERT – Change the maximum number of words allowed in embedding, a single keyphrase, – Restrict the keyphrase to only contain the top cer- (2.) Cluster the documents into K groups based on tain percentage of all keywords. their Sentence-BERT embeddings, i.e., by the sen- tence contents, 5.5. Evaluation: Cluster-Based (3.) For each document cluster, extract its keyphrases. Performance First, we generate embedding representations for each Using a similar evaluation function as in System 1 (See text, which is very easy by the Python package sentence- Section 4.4), we now look at a cluster-based objective. transformers . The package sentence-transformers This means that we take all the keywords from the arti- cles clustered in the same group and build a new refer- ence list of keywords. Subsequently, the evaluation of the user-generated list will be compared with this expanded list. Notably, this approach increases the coverage of keywords in the reference, in the hope of covering more out-of-abstract keywords in this expanded list. However, it comes at the cost of increasing the denominator when we compare the user-generated list to the reference list. One way to better present the reference list of one clus- ter is to process the list by criteria such as frequency. Another way to evaluate is using word embedding sim- ilarities (c.f. KeyBert [33] as an example of leveraging embeddings). In this way, we have a better view of the extracted keywords and the degree to which the user- generated list is close to the reference list. In particular, Figure 2: NZZ Topic Page based on keywords and named this technique is useful for assessing the difference set entities from news articles. Accessible at nzz.ch/themen. between the user-generated list and the reference one. 6. System 3: Named-Entity are not limited by the fixed categories of an NER model, and may contain named entities if those entities are repre- Recognition as Keyword sentative of a given document. For example, a document Extraction about Heathrow Airport can contain keywords such as “arrival”, “customs”, “departure”, “duty free”, “immigra- The goal of system 3 is to emulate some of the constraints tion” and “London”. Depending on the model classes, an that may exist in a practical setting. These could be sit- NER model on the same text could extract entities such uations where a keyword extractor system cannot be as “British Airways” (ORG), “London” (LOC), “United implemented as the output of these systems may be in- Kingdom” (LOC), etc. In this example, there is overlap correct or non-sensical. Another situation could be that between the keywords and named entities; however, due one is required to use existing tools such as a Named- to the defining characteristics of both approaches, there Entity Recognition system and must enact measures to is a significant difference between the lists. improve the output of the model. Figure 2 demonstrates the use of keyword extraction and named-entity recognition in the industry setting 6.1. Named Entities, Named-Entity at Neue Zürcher Zeitung (NZZ), where key terms are Recognition and Keyword Extraction extracted and relevant articles are assigned to the terms. A named entity (NE) in most cases is a proper noun, the 6.2. Use of Keywords in the News Domain most common categories being person, location and or- ganization; however, other categories that are not proper As mentioned above, for a given text, keywords and the nouns, such as temporal expressions, are also possible. output of a NER model may overlap. When it comes Named-Entity Recognition consists of locating and classi- to analyzing news, a typical NER model (with common fying named entities mentioned in unstructured text into categories such as person, organization, and location) predefined categories [57, Chapter. 8.3]. Keywords are excels at finding named entities for the model-specific single or multi-word expressions that under ideal circum- categories. However, only extracting the entities is inad- stances should concisely represent the key content of a equate for finding nuanced differences between multiple document [58, Page 3]. As the goal of NER is to assign articles that contain identical named entities. In Table 2 a label to spans of text [57, Chapter. 8.3], it is a classi- we see the titles of 10 articles published in Neue Zürcher fication task that can be solved by building a machine Zeitung (NZZ) during March 2022. According to the NER learning model [59]. model for German texts used internally by the NZZ, all The difference between keyword extraction and NER articles have “Ukraine” (location) as a common named is as follows. Named entities are words or phrases with a entity. Despite the similarities, there are thematic dif- specific label determined by predefined classes of a given ferences between these articles. After using a keyword NER model. Therefore, these entities may not necessarily extraction system that uses similar methodologies men- represent the essential content of a document. Keywords tioned in Systems 1 and 2, keywords that are not named 6.4. Pre-Trained NER Models Number NZZ Article Title 1 Eine Zürcherin nimmt ukrainische Flüchtlinge auf – und fühlt sich vom Staat alleingelassen «Eine Solidaritätsbekundung auf Instagram zu posten, reicht nicht»: 2 3 There are some disadvantages to using pre-trained NER Viele Zürcherinnen und Zürcher möchten Flüchtlinge aus der Ukraine bei sich zu Hause aufnehmen 150 Ukraine-Flüchtlinge sind im Kinderdorf – wie geht es weiter? 4 Krieg in der Ukraine: Wie ein SVP-Dorf Flüchtlinge aufnimmt 5 models. One should take into consideration that using a Neutralität im Ukraine-Krieg - wo genau steht die Schweiz? 6 Neutralität: Fand in der Schweiz gerade eine Zeitenwende statt 7 pre-trained model to extract named entities out of docu- Putin, die Schweiz und die zwei Seiten der Neutralität 8 Christoph Blocher: Neutralität ist nicht nur Selbstzweck 9 10 ments from different domains can result in a fall in model Sicherheitspolitik: Militärische Neutralität weiterdenken Sicherheitspolitik: Solidarische Neutralität performance [65]. The training data and categories of Table 2 the model will influence the output. For example, the Titles of 10 articles published in Neue Zürcher Zeitung (NZZ) string “ATP” can be labeled as an organization (e.g. As- during March 2022. sociation of Tennis Professionals) by one model and as a chemical (e.g. adenosine triphosphate) by a biomedical- NER model. Creating an NER model for a specific type entities were found. These keywords demonstrate the- of entity requires the annotation of a corpus, which can matic groupings between the articles. The most common be a significant expense and effort for the user [65]. keyword for articles 1-4 is “Flüchtlinge” (“refugees”), and for articles 5-10 is “Neutralität” (“neutrality”). This differ- 6.5. Further Ideas ence can also be observed in the article titles, and upon closer inspection of the article content, it is evident that The challenge of this system lies in working with pre- some of the articles (1-4) revolve around the topic of calculated data from systems that cannot be influenced. refugees from Ukraine, while other articles (5-10) discuss The participants are provided with multiple tables with the notion of neutrality. Using named entities or, in some the output of two different NER systems, fastText doc- cases, a predefined list of keywords can be useful to de- ument, and word vectors (see Section 6.3). In addition, fine broad topic pages (see nzz.ch/themen), but keywords they also have a table at their disposal to verify whether offer concise yet semantically insights into the content a keyword for a given document is present in the abstract of a document. Therefore, they can be potentially used and whether it was discovered by any of the NER models to automatically identify possible subtopics with a news (with 100% string matches). The intuition of System 3 story or discover emerging topics from newly published is that given the resources (cost, time, hardware), one articles. needs to come up with the best possible strategies to detect meaningful keywords. 6.3. Data Preparation 6.6. Evaluation: Instance-based The FLAIR framework [60] was chosen as it contains many out-of-the-box NER models for generic and biomed- Performance ical texts. Furthermore, the framework is also useful In addition to the pre-calculated data, the participants for integrating pre-trained embeddings and models. As were also given evaluation functions to compare differ- many of the texts are from the biomedical domain, the ences between their system NER model output and the ScispaCy library was used for word and sentence tok- keyword list that came with the documents. There are enization [61]. The results of the NER models were given cases where an item from the curated keyword list does to the participants. The ner-english model is a 4-class not contain the keyword in the abstract, or contains a NER model for English, which comes with FLAIR [62]. partial or inflected form of the keyword. The evaluation This model has the following categories: locations (LOC), function contains a partial string matching sequence, persons (PER), organizations (ORG), and miscellaneous where one can choose the amount of character similarity (MISC) [63]. We also provided participants with NER between two strings. For example, a document has the results from HunFlair [38], which is an NER tagger for label “radio frequency”, but the string “radio frequen- biomedical texts. This biomedical NER tagger is based on cies” is present in the abstract and the inflected form the HUNER tagger, and has the follwing named-entity cat- was also found by one of the NER models. For this case, egories: Chemicals, Diseases, Species, Genes or Proteins, participants can set a string similarity value (e.g., 80% and Cell lines [64]. As an additional hint to participants, similarity) to circumvent the issues caused by inflected document embeddings for each item in the train and test forms, or partially mentioned forms (“radio frequency” sets, as well as word embeddings for the entire corpus, vs. “radio frequency scanner”). Using the resources at were generated from a fastText model2 trained on the their disposal, participants must develop the best possible English Common Crawl dataset (cc.en.300.bin )3 . strategies to build a system that can detect the maximum number of relevant keywords. 2 https://fasttext.cc/ (last accessed: June 20, 2022). 3 https://fasttext.cc/docs/en/crawl-vectors.html (last accessed: June 20, 2022). 7. Participant Contributions Swiss National Science Foundation (Project Number 200021_184628, and 197485), Innosuisse/SNF BRIDGE Our participants have further investigated keyphrase ex- Discovery (Project Number 40B2-0_187132), European tractions in System 1 and provided valuable contributions Union Horizon 2020 Research and Innovation Programme to our proceedings. Their original theses can be founded (DAPHNE, 957407), Botnar Research Centre for Child at the following Google Drive folder. Health, Swiss Data Science Center, Alibaba, Cisco, eBay, The basic TextRank keyword extractor in System 1 Google Focused Research Awards, Kuaishou Inc., Oracle has been extended to account for the following data pre- Labs, Zurich Insurance, and the Department of Computer processing steps: (1) remove numbers; (2) restrict valid Science at ETH Zurich. We would like to thank Neue keywords to only nouns; (3) restrict valid keywords by Zürcher Zeitung for collaborating on this project. imposing the minimum string length. The contribution can be found on the Google Drive folder. Additionally, the evaluation system has been gener- References alized to output numerical performance scores, allow- ing simpler comparisons of different keyword extractors. [1] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in The contribution can be found on the Google Drive folder. scientific publications, in: D. H.-L. Goh, T. H. Cao, Finally, a comparison between the TextRank algorithm I. T. Sølvberg, E. Rasmussen (Eds.), Asian Digital and further unsupervised keyphrase extraction methods Libraries. Looking Back 10 Years and Forging New has been provided. The limitation of TextRank is that it Frontiers, Springer Berlin Heidelberg, Berlin, Hei- only considers the co-occurences of the word pair and delberg, 2007, pp. 317–326. not the semantical meanings, which may cause certain [2] S. N. Kim, M.-Y. Kan, Re-examining automatic extracted “frequent” word pairs to either be irrelevant or keyphrase extraction approaches in scientific ar- under-represented. Therefore, an experiment has been ticles, in: Proceedings of the Workshop on Mul- performed using the pke library to compare the perfor- tiword Expressions: Identification, Interpretation, mance of the TextRank algorithm and several other unsu- Disambiguation and Applications (MWE 2009), As- pervised keyphrase extraction algorithms on the bench- sociation for Computational Linguistics, Singapore, mark test dataset. The contribution can be found on the 2009, pp. 9–16. Google Drive folder. [3] E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, Beyond the academic setting, the use of keyword ex- C. G. Nevill-Manning, Domain-specific keyphrase tractions is demonstrated in the industry setting, where extraction, in: T. Dean (Ed.), Proceedings of the Wyona AG utilizes keyword extractors in the working Sixteenth International Joint Conference on Artifi- pipeline of the Q&A Chatbot “Katie”. The contribution cial Intelligence, IJCAI 99, Stockholm, Sweden, July can be found on the Google Drive folder. 31 - August 6, 1999. 2 Volumes, 1450 pages, Morgan Kaufmann, 1999, pp. 668–673. [4] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, 8. Conclusion E. Frank, Improving browsing in digital libraries with keyphrase indexes, Decision Support Systems In this workshop, we provided the background and base- 27 (1999) 81–104. doi:https://doi.org/10.1016/ line systems for keyword extraction, shared a benchmark S0167- 9236(99)00038- X . dataset on scientific keyword extraction, and invited con- [5] O. Medelyan, I. H. Witten, Domain-independent au- tributions from participants from industry and academia. tomatic keyphrase indexing with small training sets, The methodologies discussed can be extended to keyword J. Am. Soc. Inf. Sci. Technol. 59 (2008) 1026–1040. extraction in other domains (e.g., legal and news). [6] O. Borisov, M. Aliannejadi, F. Crestani, Keyword extraction for improved document retrieval in con- versational search, CoRR abs/2109.05979 (2021). Acknowledgements [7] J. Han, T. Kim, J. Choi, Web document cluster- The authors would like to thank the organizers from Swis- ing by using automatic keyphrase extraction, in: sText2022 for hosting our workshop. Peter Egger and the 2007 IEEE/WIC/ACM International Conferences on Chair of Applied Economics acknowledge the support of Web Intelligence and Intelligent Agent Technology - the Department of Management, Technology, and Eco- Workshops, 2007, pp. 56–59. doi:10.1109/WI- IATW. nomics at ETH Zurich. Ce Zhang and the DS3Lab grate- 2007.46 . fully acknowledge the support from the Swiss State Secre- [8] K. M. Hammouda, D. N. Matute, M. S. Kamel, tariat for Education, Research and Innovation (SERI) un- Corephrase: Keyphrase extraction for document der contract number MB22.00036 (for European Research clustering, in: P. Perner, A. Imiya (Eds.), Ma- Council (ERC) Starting Grant TRIDENT 101042665), the chine Learning and Data Mining in Pattern Recogni- tion, Springer Berlin Heidelberg, Berlin, Heidelberg, [19] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, J. Hu, Patent 2005, pp. 265–274. keyword extraction algorithm based on distributed [9] M. Litvak, M. Last, Graph-based keyword extrac- representation for patent classification, Entropy 20 tion for single-document summarization, in: Pro- (2018). doi:10.3390/e20020104 . ceedings of the Workshop on Multi-Source Multi- [20] H. Ding, X. Luo, Attention-based unsupervised lingual Information Extraction and Summarization, keyphrase extraction and phrase graph for covid- MMIES ’08, Association for Computational Linguis- 19 medical literature retrieval, ACM Trans. Comput. tics, USA, 2008, p. 17–24. Healthcare 3 (2021). doi:10.1145/3473939 . [10] K. Sarkar, A keyphrase-based approach to text [21] M. Komenda, M. Karolyi, A. Pokorná, M. Víta, summarization for english and bengali documents, V. Kríž, Automatic keyword extraction from medi- Int. J. Technol. Diffus. 5 (2014) 28–38. doi:10.4018/ cal and healthcare curriculum, in: 2016 Federated ijtd.2014040103 . Conference on Computer Science and Information [11] J. R. Thomas, S. K. Bharti, K. S. Babu, Automatic Systems (FedCSIS), 2016, pp. 287–290. keyword extraction for text summarization in e- [22] Q. Li, Y.-F. B. Wu, Identifying important concepts newspapers, in: Proceedings of the International from medical documents, Journal of Biomedical Conference on Informatics and Analytics, ICIA-16, Informatics 39 (2006) 668–679. doi:https://doi. Association for Computing Machinery, New York, org/10.1016/j.jbi.2006.02.001 . NY, USA, 2016. doi:10.1145/2980258.2980442 . [23] A. Zehtab-Salmasi, M.-R. Feizi-Derakhshi, M.-A. [12] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin, Balafar, Frake: Fusional real-time automatic key- SemEval-2010 task 5 : Automatic keyphrase ex- word extraction, 2021. arXiv:2104.04830 . traction from scientific articles, in: Proceedings of [24] C. Sun, L. Hu, S. Li, T. Li, H. Li, L. Chi, A review of the 5th International Workshop on Semantic Eval- unsupervised keyphrase extraction methods using uation, Association for Computational Linguistics, within-collection resources, Symmetry 12 (2020). Uppsala, Sweden, 2010, pp. 21–26. doi:10.3390/sym12111864 . [13] J. Li, Y. Li, Z. Xue, Keywords extraction algorithm [25] D. Mahata, R. R. Shah, J. Kuriakose, R. Zimmermann, of financial review based on dirichlet multinomial J. R. Talburt, Theme-weighted ranking of keywords model, in: Y. Jia, W. Zhang, Y. Fu (Eds.), Proceedings from text documents using phrase embeddings, in: of 2020 Chinese Intelligent Systems Conference, 2018 IEEE Conference on Multimedia Information Springer Singapore, Singapore, 2021, pp. 107–116. Processing and Retrieval (MIPR), 2018, pp. 184–189. [14] M. Pejić Bach, Z̆. Krstić, S. Seljan, L. Turulja, Text doi:10.1109/MIPR.2018.00041 . mining for big data analysis in financial sector: A [26] S.-C. Kuai, W.-H. Liao, C.-Y. Chang, G.-J. Yu, Fb- literature review, Sustainability 11 (2019). doi:10. kea: A feature-based keyword extraction algorithm 3390/su11051277 . for improving hit performance, in: 2021 IEEE In- [15] M. Jungiewicz, M. Łopuszyński, Unsupervised ternational Conference on Consumer Electronics- keyword extraction from polish legal texts, in: Taiwan (ICCE-TW), 2021, pp. 1–2. doi:10.1109/ A. Przepiórkowski, M. Ogrodniczuk (Eds.), Ad- ICCE- TW52618.2021.9602870 . vances in Natural Language Processing, Springer [27] R. Saga, H. Kobayashi, T. Miyamoto, H. Tsuji, Mea- International Publishing, Cham, 2014, pp. 65–70. surement evaluation of keyword extraction based [16] D. Wu, W. Uddin Ahmad, S. Dev, K.-W. Chang, on topic coverage, in: C. Stephanidis (Ed.), HCI Representation Learning for Resource-Constrained International 2014 - Posters’ Extended Abstracts, Keyphrase Generation, arXiv e-prints (2022). Springer International Publishing, Cham, 2014, pp. arXiv:2203.08118 . 224–227. [17] J. Piskorski, N. Stefanovitch, G. Jacquet, A. Po- [28] F. Liu, X. Huang, W. Huang, S. X. Duan, Per- davini, Exploring linguistically-lightweight key- formance evaluation of keyword extraction meth- word extraction techniques for indexing news arti- ods and visualization for student online comments, cles in a multilingual set-up, in: Proceedings of the Symmetry 12 (2020). doi:10.3390/sym12111923 . EACL Hackashop on News Media Content Analysis [29] L. Bornmann, R. Mutz, Growth rates of modern and Automated Report Generation, Association for science: A bibliometric analysis based on the num- Computational Linguistics, Online, 2021, pp. 35–44. ber of publications and cited references, Journal [18] S. Suzuki, H. Takatsuka, Extraction of keywords of the Association for Information Science and of novelties from patent claims, in: Proceedings of Technology : JASIST 66 (2015-11) 2215 – 2222. COLING 2016, the 26th International Conference doi:10.1002/asi.23329 , published online 29 April on Computational Linguistics: Technical Papers, 2015. The COLING 2016 Organizing Committee, Osaka, [30] B. Hua, Y. Shin, Extraction of sentences describing Japan, 2016, pp. 1192–1200. originality from conclusion in academic papers, in: Y. Zhang, C. Zhang, P. Mayr, A. Suominen (Eds.), [42] I. Montani, M. Honnibal, M. Honnibal, S. V. Proceedings of the 1st Workshop on AI + Informet- Landeghem, A. Boyd, H. Peters, P. O. McCann, rics (AII2021) co-located with the iConference 2021, M. Samsonov, J. Geovedi, J. O’Regan, D. Altinok, Virtual Event, March 17th, 2021, volume 2871 of G. Orosz, S. L. Kristiansen, D. de Kok, L. Mi- CEUR Workshop Proceedings, CEUR-WS.org, 2021, randa, Roman, E. Bot, L. Fiedler, G. Howard, Ed- pp. 58–70. ward, W. Phatthiyaphaibun, R. Hudson, Y. Tamura, [31] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, S. Bozek, murat, R. Daniels, P. Baumgartner, J. Oyarzábal, A. Valencia, Chemdner: The drugs M. Amery, B. Böing, explosion/spaCy: New and chemical names extraction challenge, Journal Span Ruler component, JSON (de)serialization of of Cheminformatics 7 (2015) S1 – S1. Doc, span analyzer and more, 2022. doi:10.5281/ [32] M. F. Porter, An Algorithm for Suffix Stripping, Mor- zenodo.6621076 . gan Kaufmann Publishers Inc., San Francisco, CA, [43] F. Barrios, F. López, L. Argerich, R. Wachenchauzer, USA, 1997, p. 313–316. Variations of the similarity function of textrank for [33] M. Grootendorst, Keybert: Minimal keyword ex- automated summarization, CoRR abs/1602.03606 traction with bert., 2020. doi:10.5281/zenodo. (2016). arXiv:1602.03606 . 4461265 . [44] F. Boudin, pke: an open source python-based [34] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja- keyphrase extraction toolkit, in: Proceedings of fari Meimandi, , M. S. Gerber, L. E. Barnes, Hdltex: COLING 2016, the 26th International Conference Hierarchical deep learning for text classification, on Computational Linguistics: System Demonstra- in: Machine Learning and Applications (ICMLA), tions, Osaka, Japan, 2016, pp. 69–73. 2017 16th IEEE International Conference on, IEEE, [45] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis- 2017. paCy: Fast and robust models for biomedical natu- [35] R. Mihalcea, P. Tarau, TextRank: Bringing order ral language processing, in: Proceedings of the 18th into text, in: Proceedings of the 2004 Conference BioNLP Workshop and Shared Task, Association on Empirical Methods in Natural Language Pro- for Computational Linguistics, Florence, Italy, 2019, cessing, Association for Computational Linguistics, pp. 319–327. doi:10.18653/v1/W19- 5034 . Barcelona, Spain, 2004, pp. 404–411. [46] A. Hagberg, P. Swart, D. S Chult, Exploring network [36] S. Lloyd, Least squares quantization in pcm, structure, dynamics, and function using networkx IEEE Transactions on Information Theory 28 (1982) (2008). 129–137. doi:10.1109/TIT.1982.1056489 . [47] D. Xu, Y. Tian, A comprehensive survey of clus- [37] J. MacQueen, Classification and analysis of multi- tering algorithms, Annals of Data Science 2 (2015) variate observations, in: 5th Berkeley Symp. Math. 165–193. doi:10.1007/s40745- 015- 0040- 1 . Statist. Probability, 1967, pp. 281–297. [48] A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade, [38] L. Weber, M. Sänger, J. Münchmeyer, M. Habibi, L. Abualigah, J. O. Agushaka, C. I. Eke, A. A. U. Leser, A. Akbik, HunFlair: an easy-to-use Akinyelu, A comprehensive survey of clustering tool for state-of-the-art biomedical named entity algorithms: State-of-the-art machine learning ap- recognition, Bioinformatics 37 (2021) 2792–2794. plications, taxonomy, challenges, and future re- doi:10.1093/bioinformatics/btab042 . search prospects, Engineering Applications of Ar- [39] J. Son, Y. Shin, Music lyrics summarization method tificial Intelligence 110 (2022) 104743. doi:https: using textrank algorithm, Journal of Korea Multi- //doi.org/10.1016/j.engappai.2022.104743 . media Society 21 (2018) 45–50. doi:https://doi. [49] J. Pennington, R. Socher, C. Manning, GloVe: Global org/10.9717/kmms.2018.21.1.045 . vectors for word representation, in: Proceedings of [40] C. Wu, L. Liao, F. Afedzie Kwofie, F. Zou, Y. Wang, the 2014 Conference on Empirical Methods in Natu- M. Zhang, Textrank keyword extraction method ral Language Processing (EMNLP), Association for based on multi-feature fusion, in: X.-S. Yang, Computational Linguistics, Doha, Qatar, 2014, pp. S. Sherratt, N. Dey, A. Joshi (Eds.), Proceedings of 1532–1543. doi:10.3115/v1/D14- 1162 . Sixth International Congress on Information and [50] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Communication Technology, Springer Singapore, Pre-training of deep bidirectional transformers for Singapore, 2022, pp. 493–501. language understanding, in: Proceedings of the [41] S. Pan, Z. Li, J. Dai, An improved textrank key- 2019 Conference of the North American Chap- words extraction algorithm, in: Proceedings of the ter of the Association for Computational Linguis- ACM Turing Celebration Conference - China, ACM tics: Human Language Technologies, Volume 1 TURC ’19, Association for Computing Machinery, (Long and Short Papers), Association for Compu- New York, NY, USA, 2019. doi:10.1145/3321408. tational Linguistics, Minneapolis, Minnesota, 2019, 3326659 . pp. 4171–4186. doi:10.18653/v1/N19- 1423 . [51] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings for sequence labeling, in: COLING embeddings using Siamese BERT-networks, in: Pro- 2018, 27th International Conference on Computa- ceedings of the 2019 Conference on Empirical Meth- tional Linguistics, 2018, pp. 1638–1649. ods in Natural Language Processing and the 9th In- [63] E. F. Tjong Kim Sang, F. De Meulder, Introduc- ternational Joint Conference on Natural Language tion to the CoNLL-2003 shared task: Language- Processing (EMNLP-IJCNLP), Association for Com- independent named entity recognition, in: Pro- putational Linguistics, Hong Kong, China, 2019, pp. ceedings of the Seventh Conference on Natural 3982–3992. doi:10.18653/v1/D19- 1410 . Language Learning at HLT-NAACL 2003, 2003, pp. [52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 142–147. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [64] L. Weber, J. Münchmeyer, T. Rocktäschel, M. Habibi, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, U. Leser, HUNER: improving biomedical NER D. Cournapeau, M. Brucher, M. Perrot, E. Duch- with pretraining, Bioinformatics 36 (2019) 295–302. esnay, Scikit-learn: Machine learning in Python, doi:10.1093/bioinformatics/btz528 . Journal of Machine Learning Research 12 (2011) [65] M. Marrero, J. Urbano, S. Sánchez-Cuadrado, 2825–2830. J. Morato, J. M. Gómez-Berbís, Named entity [53] J. D. Hunter, Matplotlib: A 2d graphics environ- recognition: Fallacies, challenges and opportu- ment, Computing in Science & Engineering 9 (2007) nities, Computer Standards & Interfaces 35 90–95. doi:10.1109/MCSE.2007.55 . (2013) 482–489. doi:https://doi.org/10.1016/j. [54] A. Bougouin, F. Boudin, B. Daille, Topi- csi.2012.09.004 . cRank: Graph-Based Topic Ranking for Keyphrase Extraction, in: International Joint Confer- ence on Natural Language Processing (IJCNLP), Nagoya, Japan, 2013, pp. 543–551. URL: https://hal. A. List of participants to the archives-ouvertes.fr/hal-00917969. workshop [55] F. Boudin, Unsupervised keyphrase extraction with multipartite graphs, in: Proceedings of the We thank our workshop participants for valuable 2018 Conference of the North American Chapter feedback, contributions, and suggestions. of the Association for Computational Linguistics: Susie Xi Rao, ETH Zurich (Organizer) Human Language Technologies, Volume 2 (Short Piriyakorn Piriyatamwong, ETH Zurich (Organizer) Papers), Association for Computational Linguis- Parijat Ghoshal, NZZ AG (Organizer) tics, New Orleans, Louisiana, 2018, pp. 667–672. Vanya Brucker, Wyona AG doi:10.18653/v1/N18- 2105 . Andrea Bussolan, SUPSI [56] M. Grootendorst, Bertopic: Neural topic modeling Mercedes García Martínez, Pangeanic with a class-based tf-idf procedure, arXiv preprint Sandra Mitrović, IDSIA USI-SUPSI arXiv:2203.05794 (2022). Sara Nasirian, SUPSI [57] D. Jurafsky, J. H. Martin, Speech and language Emmanuel de Salis, HE-Arc processing (draft), preparation [cited 2020 June 1] Natasa Sarafijanovic-Djukic, FFHS Dietrich Trautmann, Thomson Reuters Available from: https://web. stanford. edu/~ juraf- Michael Wechner, Wyona AG sky/slp3 (2018). Peter Egger, ETH Zurich (Principal Investigator) [58] M. W. Berry, J. Kogan, Text mining: applications Ce Zhang, ETH Zurich (Principal Investigator) and theory, John Wiley & Sons, 2010. [59] A. Mansouri, L. S. Affendey, A. Mamat, Named en- tity recognition approaches, International Journal of Computer Science and Network Security 8 (2008) 339–344. [60] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use framework for state-of-the-art NLP, in: NAACL 2019, 2019 Annual Conference of the North Ameri- can Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59. [61] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis- pacy: Fast and robust models for biomedical natural language processing, CoRR abs/1902.07669 (2019). [62] A. Akbik, D. Blythe, R. Vollgraf, Contextual string