-

1613-0073

Contextual Embeddings to Extract Topic Genealogy from Scientific Literature

Alfio

0 1 2

Ferrara

0 1 2

Stefano Montanelli

stefano.montanelli@unimi.it 0 1 2

Sergio Picascia

sergio.picascia@unimi.it 0 1 2

Davide Riva

davide.riva1@unimi.it 0 1 2 0 Natural Language Processing , Scientometrics, Topic Genealogy 1 Via Celoria , 18 - 20133 Milano , Italy 2 Workshop Proce dings

Modeling the evolution of topics and forecast future trends is a crucial task when analyzing scientific papers. In this work we propose tASKE (temporal Automated System for Knowledge Extraction), a dynamic topic modeling approach which exploits zero-shot classification and contextual embeddings in order to track topic evolution through time. The approach is evaluated against a corpus of data science papers, assessing the ability of tASKE to correctly classify documents and retrieving relevant derivation relationships between older and new topics in time.

CEUR ceur-ws.org

1. Introduction

With the amount of published scientific literature increasing each year, keeping track of newly formulated topics and their derivation process becomes a challenge for researchers, scholars, and publishers. The problem lies in the fact that the total amount of definitions, theorems, properties, tasks, and subdomains tends to grow exponentially, since several of them may be conceived starting from a single one or the interaction of a few ones. For instance, in the domain of Machine Learning, the idea of neural networks gave rise to that of deep learning, which has then been applied to problems such as image reconstruction and partial diferential equations , and was further deepened with topics such as attention, which in turn provided the intuition behind transformers and a basis for explainability.

Referring to definitions, theorems, properties, tasks, ics”, abstract objects a text refers to, it is possible to study “topic genealogy” in a diachronic corpus, i.e. the descent of topics from older ones over time. The task of extracting topic genealogy falls within the scope of Knowledge Extraction (KE), and it consists of two main sub-tasks: i) topic extraction, by which we aim to retrieve topics that are important in a written document, possibly in a timely nEvelop-O SDU2023: The Third AAAI Workshop on Scientific Document Understanding, February 14, 2023, Washington, DC CEUR htp:/ceur-ws.org ISN1613-073

CEUR

Workshop Proceedings (CEUR-WS.org) manner in order to discover topics when they actually appear, and ii) genealogy reconstruction, in which extracted topics are placed in a tree structure representing their lineage in the history of the discipline.

In this paper, we present tASKE, a method to extract topics from a diachronic corpus of scientific papers and reconstruct their genealogy in a completely unsupervised way. Our method is developed upon our Automated System for Knowledge Extraction (ASKE) framework [ 1 ], which relies on pre-trained contextual embedding models to represent documents and topics in the same vector space and on a cyclical term extraction and clustering phase to extract new topics. Besides presenting tASKE as a time-aware extension of ASKE, we introduce an evaluation framework and a case study on a corpus of domain, with the goal of demonstrating the efectiveness of tASKE both for topic extraction and for extracting

The work is organized as follows: Section 2 Related Work reports on the literature about topic modeling as well as the technology underlying our method. Section 3 Methodology presents the methodology and techniques enforced in tASKE. Section 4 Case Study and Evaluation presents the case study on a Data Science Literature corpus, on which the evaluation was conducted. Section 5 Concluding Remarks draws some conclusions and sketches some future work.

2. Related Work

The task of classifying large amounts of textual documents without relying on labeled data and presenting laaddressed employing topic modeling techniques. Latent classification derivation belonging derivation

Semantic Analysis (LSA) [ 2 ] was one of the first proposed [ 11 ], each of them being tailored to specific tasks, such approaches, exploiting Singular Value Decomposition as semantic similarity [ 12 ] and zero-shot learning [ 13 ]. (SVD) in order to reduce the number of dimensions of Zero-Shot Learning (ZSL) is a problem setup in the a document-term matrix and to easily compute similar- field of machine learning, where a classifier is required ity between document vector representations. LSA was to predict labels of examples extracted from classes that soon followed by Latent Dirichlet Allocation (LDA) [ 3 ], were never observed in the training phase. It was firstly which employs Bayesian analysis in order to optimize referred to as dataless classification in 2008 [ 14 ] and has the distributions of documents belonging to topics, and quickly become a subject of interest, particularly in the of words defining these topics. The majority of recent ifeld of NLP. The great advantage of this approach conworks in topic modeling takes its inspiration from the sists in the resulting classifier being able to operate efioriginal LDA with several variations proposed, such as ciently in a partially or totally unlabeled environment. Correlated Topic Modeling [ 4 ] and Hierarchical Topic tASKE aims at dynamically modeling the presence Modeling [ 5 ]. and evolution of latent topics in a diachronic corpus of

Common topic modeling methods are not able to cap- documents. It exploits zero-shot learning and contextual ture the changes of topics over time. For this reason, embeddings not only to perform the classification task, techniques of Dynamic Topic Modeling (DTM) are em- but also to extract relevant knowledge from textual data. ployed when dealing with diachronic corpora. Since the ifrst approach (Dynamic LDA [ 6 ]) was proposed, the field has been attracting attention among researchers. Among 3. Methodology the possible applications of the designed methods, the The objective of tASKE is to extract a genealogy of topics study of scientific papers, also known as “Scientometrics”, from a diachronic corpus of documents. Every piece of inwas addressed with the aim to assess past and present formation is stored in a graph-based data structure called trends in a specific discipline [ 7 ] or to forecast possible tASKE Conceptual Graph (ACG), whose architecture is future subareas of research interest [ 8 ]. illustrated in Figure 1.

Later studies have been taking into consideration the The nodes in the ACG model belong to three diferent integration between DTM and word embeddings [ 9 ] so to categories: further capture the semantic aspect of the analyzed documents [ 10 ]. Embedding techniques are vastly employed in the field of Natural Language Processing (NLP), in order to represent textual data in a vector space. Several models capable of computing contextual token embeddings have been released since the presentation of BERT • document chunks : the object of the analysis, they are small portions of the original documents extracted through the application of tokenization techniques. They are tuples of the form (, k), where is the text of the document chunk and k is its vector representation; embedding model corpus

data preprocessing

ACG topic formation tion; • terms : they are extracted from document chunks and clustered together in order to form topics. They are triplets of the form ( , , w), where is the label of the term, is a short sentence giving the term definition, and w is the term vector representation. is the closest to c.

The vector representations k of document chunks and w of terms are computed by an embedding model which maps a text into a vector space: for k, the embedding model is applied over the document chunk text , while for w, it is applied over the term definition . The vector representation c of topics is computed as the mean of the vectors w of all the terms belonging to . The label of each topic corresponds to the label of the term that

At the beginning of the analysis (i.e., time 0) the user Each topic sponding terms (0) ∈ (0) is associated with a set of corre

(0), whose definitions are also provided by the user. At each subsequent time , tASKE performs one or more iterations of the cycle depicted in Figure 2. As a first step, tASKE extracts the set of document chunks () from the subset () ∈ belonging to that period. Such document chunks are classified with respect to the topics discovered up to the previous time period ←, ( ←). Moreover, tASKE extracts new terms • topics : they represent the abstract objects to () from the document chunks and assigns them to the which documents chunks are assigned and, in practice, they are clusters of related terms. They are tuples of the form (, c), where is the label

() .

given to the topic and c is its vector representa- can have multiple relations with the other components of is required to define a set of initial topics (0) of interest. each of the aforementioned relations is discovered. terminology enrichment knowledge

base topics ( ←), finally updating the set of current topics As a consequence of this process, in ACG a topic

( ←), we have i) relation classificaACG. In particular, for tion with document chunks () ; ii) a relation derivation with terms ciated with () discovered from document chunks asso( ←); iii) a relation belonging with terms topic in its cluster; iv) a relation derivation with a new topic formed by some of the terms in . It can happen that ( ←) is not associated with any document chunk at () time . This means ( ←) is no longer a useful topic with respect to the documents of time . In this case, the topic ( ←) becomes inactive, together with the set of terms belonging to it, and it will not be able to form new topics.

This can be interpreted as the disappearance of interest towards a certain topic, which emerged in past periods ← but has lost its relevance in the current corpus, () .

In the remaining part of this section, we will discuss each phase in Figure 2, explaining in deeper details how

3.1. Data Preprocessing

from the period-specific subcorpus () .

Preprocessing is the starting point of the tASKE cycle. At each time period 0, … , , the model retrieves documents

Documents are first split into document chunks () , each of which can fit into the maximum input length of a contextual embedding model. In this case, we employed Sentence-BERT [ 12 ], a modification of the original BERT by the aforementioned embedding model. This approach model, which exploits siamese and triplets networks, be- addresses the problem of sense disambiguation, since ing able to derive semantically meaningful sentence em- it maps distinct senses of polysemic words to diferent beddings in form of numeric vectors. Such a model is embedding vectors. employed in order to extract the semantic features of term definitions and document chunks and map them into the same vector space.

3.2. Zero-Shot Classification

sification relationships are defined. Given the coexistence In the zero-shot classification phase, document-topic clas- topic ( ←): of topics and document chunk embeddings in the same vector space, it is possible to perform a zero-shot classification, ∶

() → ( ←), without having the model exposed to training examples. A similarity measure (e.g., cosine similarity) between the embedding vector k() of each document chunk

() in () and the embed- in () . ding vector c ( ←) of each topic

( ←) in ( ←) is computed and, eventually, the two are associated if their similarity is higher than a predefined threshold :

For each retrieved term sense, the same similarity measure used for classification is exploited in order to compute the similarity between w and the vectors representing topics and document chunks. The terms whose sum of similarities is greater than the hyperparameter become candidates for enriching the terminology of the ( Tuning hyperparameter is crucial since it may re- terminology enrichment will include only a small set of ( ←). where k() is the centre of the embeddings of chunks

The set of candidate terms is sorted in descending order according to the similarity score. In addition, one can also define a learning rate , which represents the maximum number of terms that can be associated to a certain topic at each iteration. Applying the bounds and ensures that, at each iteration, the process of terms that are supposed to be meaningful with respect to the topic at hand.

Taking as example the topic mentioned in the previous section, ‘causality’, it has been associated, among others, with the following terms and similarity scores: ‘causality’ (0.773), ‘etiologic’ (0.741), ‘noncausal’ (0.737).

3.4. Topic Formation

Finally, tASKE may generate new topics in a topic formation phase. In this phase a clustering algorithm, such as Afinity Propagation [ 16 ], is applied over the embedding vectors w of the terms

() related to each topic According to the results, a diferent operation is enforced: ( ←)( () ) = { ∈ ( ←) ∶ ( k() , c ( ←)) ≥ } markably afect the classification output: for example, choosing a high value of could result in a highly precise classification, despite potentially finding only a small set of document chunks for each topic (low recall).

Finally, classification

relationships are stored in ACG topics its chunks are labelled with. by considering documents as the simple concatenation of their chunks, so that a document is labelled with all For example, the document chunk [...] graphical representations of causation have been used for at least seventy years, and the modern development of directed acyclic graphs to portray causal systems continues the trend. It is sometimes dificult to understand, however, what it is about these diagrams that is causal [...] is classified by tASKE with the topic ‘causality’ with a similarity score of 0.652.

3.3. Terminology Enrichment

For each topic lemmatized terms ment chunks ( ←) in the ACG, tASKE retrieves the set of

() appearing in the subset of docu() associated with ( ←) by a classification relation. These terms vectors are placed in the same semantic space, together with K and c, retrieving their definition from an external knowledge base, such as WordNet [ 15 ], and computing its vector representation w original topic the concept label of • derivation: if new clusters, diferent from are formed, each of them becomes a new topic, derived from term closer to the cluster center;

( ←), whose label is set equal to the • conservation: if no new cluster is formed, the ( ←) is preserved, represented by the cluster in which the term corresponding to ( ←) is present; • pruning: if a new cluster its member terms belong also to topic is absorbed by the older one.

() is formed but all ( ←), the newer

In the end, term-topic belonging relationships and topic-topic derivation relationships are stored in the ACG together with document-topic classification relationships defined in the zero-shot classification phase, building up the topic genealogy. Topics () defined in this phase will serve as input for the next iteration.

Considering the topic ‘causality’, consisting of the following set of terms {‘causality’, ‘etiologic’, ‘noncausal’, ‘event’, ‘issue’, ‘circumstance’, ‘interpretation’, ‘explanandum’}, the tASKE model has formed three topics with the corrisponding sets of terms: ‘causality’ = {‘causality’, ‘etiologic’, ‘noncausal’}, ‘event’ = {‘event’, ‘issue’, ‘circumstance’}, ‘interpretation’ = {‘interpretation’, ‘explanandum’}. 4. Case Study and Evaluation tASKE is here evaluated on a case study on Data Science literature. The evaluation framework has to account for three targets: 1. correctness of extracted topics, 2. correctness of the time of extraction, 3. correctness of topic-topic derivation relationships.

First, a “Data Science in Scopus” corpus (hereon ScopusDS Corpus), made of abstract of journal papers ranging from January 2000 to December 2021, is constructed. Then keywords defined by authors of each paper are exploited to generate a ground truth for all three targets, and our method is evaluated against the ground truth. Finally we perform a brief qualitative analysis of results, which is complementary to quantitative evaluation.

4.2. Definition of a Ground Truth

Keywords provided by the authors of each paper are natural candidates to form a ground truth for topic modelling 4.1. Corpus Construction of scientific papers. Exact matching between keywords and extracted topics, however, would yield no significant The ScopusDS corpus has been retrieved from Elsevier result, because topics are defined as sets of terms whereas Scopus by downloading publications in the time inter- keywords are strings, and author-assigned keywords may val from January 2000 to December 2021 according to not be linked to terms in the external knowledge base selected subject areas that are concerned with the “data employed in tASKE. Hence we define an alternative evaluscience” subject. For each publication, eid, year, title, ab- ation methodology which makes use of a non-contextual stract, document type, and author-assigned keywords have word embedding model to compute the similarity bebeen downloaded. Furthermore, additional metadata are tween keywords and extracted topics. retrieved (e.g., author name and afiliation, journal/con- For target (1), we compare clusters extracted by tASKE ference name, ISSN, publication type). The corpus con- with the set of keywords at each time . tent is described in Table 1 in terms of considered subject For target (2), we are interested in knowing whether areas and corresponding number of retrieved publica- the topics were extracted at the correct time, so we comtions. pare clusters extracted at each time with the entire set of

Besides the paper abstract, two pieces of metadata were keywords. A comparison of the resulting metrics with taken into account in the analysis: the publication date and the ones obtained for target (1) provides an indicator of the list of keywords provided by the author(s). We selected the timeliness of tASKE extraction: if a topic , extracted only documents of type “article” that are accompanied by tASKE at time is more similar to keywords from time by at least 3 keywords and are at least 30 words long, ′ ≠ than to the ones from , then can be deemed more ifnally amounting to 766,867 documents. Figure 3 shows appropriate to describe the subcorpus at time ′ and was the number of documents and keywords per year. extracted either “too soon” or “too late”.

Defining target (3) is more complicated, since no genealogical structure is inherently defined on paper keywords. We must first define a set of heuristics to derive a ground truth from the keyword lists assigned to documents. Specifically, we say that a subsequent keyword ′ is derived from an antecedent keyword at time if: • was associated to any document at any time

← < ; • ′ has never been associated to any document at

any ← < ; • the number of keyword co-occurrences at , , is

such that ( , ′) ≥ 1.

4.3. Quantitative Evaluation

We run tASKE on the ScopusDS corpus by selecting years as time units in which the corpus is split. Since tASKE requires to be initialized with a set of input topics mean and the standard deviation of the results for each year. (0) = { 1(0), … , (0)}, we exclude papers of year 2000 from Outcomes displayed in Figure 4 are promising, with a the evaluation and use the set of keywords assigned to mean similarity going from 86.99% in 2001 ( = 9.98% ) them to derive (0). This set of terms (0) is first fil- to 80.20% in 2021 ( = 13.47% ), touching a minimum tered to retain only terms that appear in WordNet, i.e. equal to 77.12% ( = 10.98% ) for year 2006. The figure the knowledge base used for this evaluation. To avoid does not prove only the efectiveness of tASKE for target the injection of spurious topics into the system, (0) is (1), i.e. to discover topics in a corpus, but also for target further filtered in order to keep only monosemic terms, (2), i.e. to discover them at the proper time. Indeed, i.e. terms that are linked to a single WordNet synset, and at each year, matching with keywords from other years the 100 with the highest frequency are sampled. In order yields better similarities only for few topics per year, as is to retrieve initial topics from this set of terms, we apply proven by the overlapping of the similarity distributions. Afinity Propagation [ 16 ], eventually obtaining = 20 In the same way as we did for each topic, we can meatopic clusters, mostly related to mathematics (e.g. re- sure the maximal similarity between each keyword and gressions analysis = {regression analysis, linear regression, the set of topics in each year, which may be considered a multiple regression}) and computer science (internet = proxy for recall. Resulting similarity distributions, going {internet, information system, bandwidth, world wide web, from 34.77% mean ( = 11.28% ) in 2001 to 65.81% mean electronic mail}), but also to domain of application (air ( = 12.76% ) in 2021, are displayed in Figure 5. Although pollution = {air pollution, air transport}). maximising recall was not our main interest, we found

As for hyperparameters, we set thresholds and that the system gets closer and closer to finding at least equal to one another so to have a single learning rate, a topic for each keyword. and since we found the system to be efective for ≤ 0.35 , the experiments were conducted with = = 0.35 to achieve eficiency in terms of computation time.

To assess the closeness of topics retrieved by tASKE to the ground truth, we train a Word2Vec model on a pseudo-corpus whose documents are a concatenation of document chunks and their ground truth keywords. By exploiting this model, as was done for instance in [ 17 ], it is possible to embed keywords and extracted terms in the same vector space. For each year, we define topic embeddings again as the centroids of the embeddings of topic-related terms, which may change from year to year even for the same topic, and we compute cosine similarity between the resulting vectors and the set of keywords. Figure 5: Distributions of similarities between keywords from This is done by single linkage, i.e. finding the closest each year and the closest topic from the same year (blue bars), keyword for each topic embedding. Figure 4 reports the or the closest topic from all years (orange bars).

As for target (3), we experimented with the same eval- 4.4. Qualitative Analysis these and the ones that accounted only for direct relation- the topics that marked recent developments or applicaships is small enough to assume tASKE can find shortterm derivations, but has more dificulty in managing long-term ones, likely due to cumulative errors. tions in the Data Science domain, such as: ‘face recognition’ (2014, from ‘biometric identification’ ) (as shown in Figure 6), ‘speech production’ (2004, from ‘wavelet’), ‘search engine’ (2004, from ‘internet’), ‘ontology’ (2006, comparing tASKE with other temporal topic modeling from ‘knowledge’), ‘clustering’ (2006, from ‘class’), ‘nat- methods and by assessing the quality of topics and their ural language processor’ (2008, from ‘internet’), ‘graphi- genealogy through the evaluation of domain experts. cal user interface’ (2008, from ‘internet’), ‘cryptanalytic’ (2010, from ‘cryptography’), ‘flight control’ (2012, from ‘flight simulator’ ), ‘machine readable’ (2017, from ‘inter- References net’), ‘automatic face recognition’ (2016, from ‘face recognition’ through ‘identity verification’ ). Another category of topics is the one that includes topics of interest but provides a spurious derivation, e.g. ‘neural network’ (2006), here derived from ‘internet’, or ‘cryptography’ (2008), that descends from ‘air pollution’. An even clearer example of the boundaries the external knowledge base imposes on tASKE is given by the topic ‘percolation’, which may refer to ‘clique percolation technique’ in the documents but is here linked to ‘air pollution’ due to the absence of any non-physical sense of term ‘percolation’ from WordNet.

We acknowledged also that most extracted topics are related to domains of application, e.g. medicine, physics, chemistry, social sciences, etc. Including these topics in a hierarchical class structure may prove beneficial to simplify visualization of the topic genealogy.

5. Concluding Remarks

Starting from the increasingly current need to understand the evolution of ideas and research themes in scientific literature, in this work we have presented tASKE, a method for identifying topics in a diachronic corpus of scientific articles. Time in tASKE is a crucial aspect, as the goal is not only to identify the topics in their right temporal collocation, but also to understand how a topic can derive from previous topics, in order to reconstruct the genealogy of the topics in time. tASKE makes it possible to achieve these objectives with an unsupervised approach, i.e., without the need to resort to large and complex preannotated datasets. The experimental results, conducted on a corpus of real scientific publications covering a period of 21 years, show how tASKE is able to identify the topics deemed relevant by the authors of the papers and expressed by means of thematic keywords. In particular, the topics identified by tASKE are not only adequate, but also placed in the correct time period and related to each other in a genealogy that described their evolution. Our current and future work on tASKE is aimed at three main goals: i) introduce an adaptive learning rate, with the aim of controlling the number of new topics discovered by tASKE for each time period according not only to the topic relevance but also to the capability of each topic to potentially induce the discovery of new topics in future iterations; ii) make tASKE independent from external knowledge bases, exploiting contextual embeddings, so to avoid restricting a-priori the vocabulary of terms that can be extracted; iii) perform further evaluations both by

[1]

Ferrara ,

Picascia ,

Riva , Contextaware knowledge extraction from legal documents through zero-shot classification , in: R. Guizzardi , B. Neumayr (Eds.), Advances in Conceptual Modeling , Springer International Publishing, Cham, 2022 , pp. 81 - 90 .

[2]

T. K.

Landauer ,

P. W.

Foltz ,

Laham , An introduction to latent semantic analysis , Discourse Processes 25 ( 1998 ) 259 - 284 . doi: 10 .1080/ 01638539809545028.

[3]

D. M.

Blei ,

A. Y.

Ng , M. I. Jordan , Latent dirichlet allocation , Journal of machine Learning research 3 ( 2003 ) 993 - 1022 .

[4]

Rabinovich ,

Blei , The inverse regression topic model , in: E. P. Xing, T. Jebara (Eds.), Proceedings of the 31st International Conference on Machine Learning , volume 32 of Proceedings of Machine Learning Research , PMLR, Bejing, China, 2014 , pp. 199 - 207 .

[5]

Gruber ,

Weiss ,

Rosen-Zvi , Hidden topic markov models , in: M. Meila , X. Shen (Eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics , volume 2 of Proceedings of Machine Learning Research , PMLR, San Juan, Puerto Rico, 2007 , pp. 163 - 170 .

[6]

D. M.

Blei ,

J. D.

Laferty , Dynamic topic models , in: Proceedings of the 23rd International Conference on Machine Learning , ICML '06, Association for Computing Machinery, New York, NY, USA, 2006 , p. 113 - 120 . doi: 10 .1145/1143844.1143859.

[7]

Sun ,

Yin , Discovering themes and trends in transportation research using topic modeling , Transportation Research Part C: Emerging Technologies 77 ( 2017 ) 49 - 66 . doi:https://doi.org/ 10.1016/j.trc. 2017 . 01 .013.

[8]

T. M.

Abuhay ,

Y. G.

Nigatie ,

S. V.

Kovalchuk , Towards predicting trend of scientific research topics using topic modeling , Procedia Computer Science 136 ( 2018 ) 304 - 310 . doi:https://doi. org/10.1016/j.procs. 2018 . 08 .284, 7th International Young Scientists Conference on Computational Science, YSC2018 , 02 - 06 July2018, Heraklion, Greece.

[9]

A. B.

Dieng ,

F. J. R.

Ruiz ,

D. M.

Blei , The dynamic embedded topic model , CoRR abs/ 1907 .05545 ( 2019 ). URL: http://arxiv.org/abs/ 1907 .05545. arXiv: 1907 .05545.

[10]

Gao ,

Huang ,

Dong ,

Liang , J. Wu, Semantic-enhanced topic evolution analysis: a combination of the dynamic topic model and word2vec , Scientometrics 127 ( 2022 ) 1543 - 1563 . URL: https:// doi.org/10.1007/s11192-022-04275-z. doi: 10 .1007/ s11192- 022- 04275- z.

[11]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[12]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , CoRR abs/ 1908 .10084 ( 2019 ). URL: http://arxiv.org/abs/ 1908 .10084. arXiv: 1908 .10084.

[13]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , CoRR abs/ 1910 .13461 ( 2019 ). URL: http: //arxiv.org/abs/ 1910 .13461. arXiv: 1910 .13461.

[14] M.-W. Chang , L.-A.

Ratinov , D.

Roth , V.

Srikumar , Importance of semantic representation: Dataless classification ., in: Aaai, volume 2 , 2008 , pp. 830 - 835 .

[15]

Fellbaum (Ed.), WordNet: An Electronic Lexical Database , Language, Speech, and Communication, MIT Press, Cambridge, MA, 1998 .

[16]

Dueck , Afinity propagation: clustering data by passing messages , University of Toronto Toronto, ON, Canada, 2009 .

[17]

Role ,

Morbieu ,

Nadif , Unsupervised evaluation of text co-clustering algorithms using neural word embeddings , in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management , CIKM '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 1827 - 1830 . URL: https://doi.org/10.1145/3269206. 3269282. doi: 10 .1145/3269206.3269282.