LLM-Resilient Bibliometrics: Factual Consistency Through Entity Triplet Extraction⋆ Alexander Sternfeld1 , Andrei Kucharavy2 , Dimitri Percia David2 , Alain Mermoud1 and Julian Jang-Jaccard1 1 Cyber-Defence Campus, armasuisse, Science and Technology, Thun, Switzerland 2 Institute of Entrepreneurship Management, HES-SO Valais-Wallis Abstract The increase in power and availability of Large Language Models (LLMs) since late 2022 led to increased concerns with their usage to automate academic paper mills. In turn, this poses a threat to bibliometrics-based technology monitoring and forecasting in rapidly moving fields. We propose to address this issue by leveraging semantic entity triplets. Specifically, we extract factual statements from scientific papers and represent them as (subject, predicate, object) triplets before validating the factual consistency of statements within and between scientific papers. This approach heavily penalizes blind usage of stochastic text generators such as LLMs while not penalizing authors who used LLMs solely to improve the readability of their paper. Here, we present a pipeline to extract such triplets and compare them. While our pipeline is promising and sensitive enough to detect inconsistencies between papers from different domains, the intra-paper entity reference resolution needs to be improved to ensure that triplets are more specific. We believe that our pipeline will be useful to the general research community working on the factual consistency of scientific texts. Keywords Bibliometrics, Entity Extraction, Machine Learning, Technological Forecasting, Quantum Computing 1. Introduction security and machine learning. Because of these factors, we investigate if factual consistency could be used for LLM- For firms to make informed investment decisions, sound resilient bibliometrics instead. forecasts on the development of technologies are necessary. Specifically, we represent facts as entity triplets of the One prominent method for technology forecasting is bib- form (subject, predicate, object) that are extracted from the liometrics, which uses the information in scholarly books claims of the paper. The entity triplet plays a crucial role as and journals [1]. Modern bibliometric methods leverage it serves as a proxy to understand the primary claims of the the increase in available data by applying machine-learning paper and subsequently validates factual consistency com- methods. For example, Percia David et al. (2023) analyse pared to other works in the domain. Our paper describes the arXiv pre-prints to evaluate the security development of workflow involved in entity triplet extraction and provides information technologies [2]. an overview of our initial findings regarding the effective- While scientific publications are thus increasingly more ness of the entity triplets and their relation to the number important for technology forecasting, the quality of the of clusters generated around the subject. The code of this papers must be evaluated critically. The publish-or-perish project is available at https://github.com/technometrics-lab/ pressure led to a record growth in the number of scien- 0-Factual_Consistency_Through_Entity_Triplets, at com- tific publications per author, often with minimal peer re- mit c7b01e4. view [3, 4]. In such a setting, if LLMs can generate text that sufficiently resembles a scientific article to pass for one on a cursory reading, they are likely to be used to generate scores 2. Related work of articles. Unfortunately, this eventuality is already likely to be a reality, given that Majovsky et al. (2023) showed Previous approaches for claim extraction can be categorized that ChatGPT can create an authentic-looking neurosurgery into heuristics and machine learning methods. The advan- scientific article [5]. tage of an approach based on heuristics is that no training Recently, there has been a growing interest in identifying data is required and the computational cost tends to be low. text generated by LLMs. As early as 2019, Zellers et al. However, machine learning approaches can capture more showed that a GPT2-like LLM Grover could detect its own complex patterns, leading to the extraction of triplets of output. However, recent research suggests that, in general, higher quality. Such methods have been developed most LLM detectors either do not work or are easy to evade [6, 7]. prominently in the biomedical domain. For example, Li et Overall, for a minimally competent attacker who wants to al. (2021) use BiLSTMs to extract the factual statements pre- evade detection, LLM detectors cannot be relied upon. sented in papers [9]. Although less labeled data is available, Unfortunately, the situation is serious enough for some there has been work focusing on claim extraction from pa- of the most reputable providers of proxies of the impact pers in different domains. For instance, Binder et al. (2022) of scientific articles to have modified their algorithms to use BiLSTMs for argumentative discourse unit recognition only consider publications adhering to stringent criteria and argumentative relation extraction [10]. [8]. Due to the velocity of innovation and the reliance on The majority of existing triplet extraction models use preprint repositories, such an approach is not adapted to supervised training. Two notable examples are RECON and technology monitoring in the domains adjacent to cyber- sPERT, which require labeled training data [11, 12]. The disadvantage of supervised methods is the need for training Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities data and the dependency on the relations that are present in from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), the dataset. In contrast, unsupervised methods do not need April 23-24, 2024, Changchun, China and Online training data and use either heuristics or machine learning $ alex1.sternfeld@gmail.com (A. Sternfeld) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- methods to extract triplets. One example of such a model is tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 85 Stanford OpenIE, which extractsrelational tuples without the Python library textacy, which is built on spaCy [21] the need to specify a schema in advance. However, it has and has a built-in extraction method that does not require been shown that OpenIE tends to extract too aggressively, the specification of relations in advance. resulting in the presence of non-useful relations [13]. We contribute by providing a method that can provide triplets 3.3. Post-processing of triplets from scientific papers with a high precision, while the user only needs to specify the desired research categories. Our goal is to compare triplets across papers. Therefore, we further process the triplets so that we can pair triplets from different papers that refer to the same subject. The 3. Methodology following steps are followed: The process of extracting informative triplets from raw PDFs 1. Lowercase all words in the triplet consists of four main stages. First, PDFs are converted to 2. Remove triplets where either the subject or object text files, after which they are preprocessed to remove word contains more than 6 words breaks and citations, expand abbreviations and lemmatize 3. Remove stopwords from the triplets based on the words. Then, we extract the sentences from the paper that list included in NLTK convey its core ideas, which we refer to as claims. From 4. Remove any character that is not text these claims, we can extract subject-predicate-object triplets. 5. Lemmatize verbs and nouns in the triplet The last step is to process these triplets further so that they 6. Remove words containing less than 3 characters can be used in a comparative analysis. The entire pipeline is displayed in Figure 1. In the following subsections, we 7. Filter the triplets non-specific to scientific work by elaborate on each of the steps. While we focus on arXiv, the comparison with a general book corpus approach is generalizable to all scientific PDFs. 8. Filter the triplets characteristic of scientific works in general by comparison with arXiv articles from different categories 3.1. Preprocessing In the second step, we choose this cutoff, as we expect To convert the pdf to text we use the PyMuPDF library [14]. that phrases of over 6 words may contain nuances that We then further clean the text by removing bracketed cita- cannot be captured in a simple subject-predicate-object rela- tions and merging words that were split due to line breaks. tion. Words with less than 3 characters are removed, as we We then expand abbreviations by using a rule-based algo- observed that such words were often noise. Moreover, as rithm introduced by Schwartz and Hearst (2003) [15]. In abbreviations are expanded we expect all informative terms appendix 6.1 we show that the Schwartz-Hearst algorithm to be at least of length 3. outperforms the scispaCy [16] and NLPRe [17] abbreviation We use the general-purpose Gutenberg book corpus to detection methods, which are built on spaCy. Moreover, due filter the triplets that carry little information. We define to the rule-based nature of the algorithm, it is relatively the number of times term 𝑖 appears at least 5 times in a fast. The example below shows the transition from a raw document in the book corpus and in the paper corpus as sentence to a preprocessed result. 𝑓𝑏,𝑖 and 𝑓𝑝,𝑖 , respectively. We then assign a score 𝑠𝑖 to each term 𝑖: Table 1 Illustration of the preprocessing steps, the parts in bold are altered during preprocessing. ⎧ βŽ¨βˆ’βˆž βŽͺ 𝑓𝑝,𝑖 < 10 𝑓 𝑓 Uncleaned Preprocessed 𝑠𝑖 = π‘™π‘œπ‘”( 𝑁𝑝,𝑖 𝑝 ) βˆ’ π‘™π‘œπ‘”( 𝑁𝑏,𝑖 𝑏 ) if 𝑓𝑝,𝑖 β‰₯ 10 and 𝑓𝑏,𝑖 > 0 Society has been affected by Society has been affected by βŽͺ ∞ if 𝑓𝑏,𝑖 = 0 and 𝑓𝑝,𝑖 β‰₯ 10 ⎩ artificial intelligence (AI) and has artificial intelligence and has become become more rel- iant on AI products. more reliant on artificial intelligent products. Terms that are not present in at least 10 papers, thus get a score of βˆ’βˆž. If the term is present in at least 10 papers, 3.2. Claim and triplet extraction the score increases when the frequency of the term in the book corpus is lower. We keep the triplets with subjects in After preprocessing the text, we identify the sentences the top 10% of the term scores. that convey the authors’ claims. Specifically, we use the In the last step, we aim to keep only the triplets that ClaimDistiller framework developed by Wei et al. (2023) carry domain-specific information. Therefore, we sample a [18]. In their work, both CNN’s and BiLSTM’s are trained for random subset of 1000 arXiv papers from December 2023 claim extraction on the PubMED-RCT and SciARK datasets from different categories than our target papers. We then [19, 20]. Although the usage of supervised contrast training only keep the triplets with subjects present in a maximum improves the performance of the model, it causes a compu- of 15 papers. tational overhead. We use the BiLSTM without supervised contrast learning to strike a balance between performance 3.4. Clustering and computational efficiency. We choose to extract the claims from the papers such that the subsequent triplet ex- As we use the extracted triplets to compare papers, it is traction will have to be performed on fewer sentences. necessary to cluster them based on the subject and object. Next, we want to reduce the claims to (subject, predi- Both SciBERT [22] encodings and spaCy [21] embeddings cate, object) triplets, analogous to the Resource Description were considered. Based on visual inspection, the result- Framework (RDF) format commonly used in the representa- ing clusters are most coherent when using SciBERT, which tion of OWL ontologies. We choose this representation, as it is a language model based on BERT [23], pretrained on a will facilitate the comparison of claims across papers. We use large multi-domain corpus of scientific publications. After 86 Figure 1: The complete pipeline for extracting entity triplets from the raw articles from the arXiv archive encoding both the subjects and objects, we utilize an agglom- 4.2. Cluster analysis eration hierarchical clustering algorithm from Scipy [24], 4.2.1. CS articles December 2023 which compares the average distance between clusters. By visually inspecting the dendograms, we set a cutoff to ob- After preprocessing the articles, we extract the triplets. For tain the final clusters. Figure 5 in the appendix shows an the triplet extraction, we used the hyperparameters dis- example of the dendograms for a subset of the subjects and played in Table 5 in the appendix. In total, 79,986 triplets objects. The threshold is chosen at the height where the were extracted from the research articles. We cluster these distance between clusters begins to noticeably increase. triplets based on both the subject and object embeddings, which resulted in 37,076 clusters. Figure 3 in the appendix 3.5. Triplet comparison shows the distribution of the number of triplets per cluster. This shows that most clusters contain less than 25 triplets, After clustering, we compare the triplets within the same but that there are outliers that contain over 200 triplets. cluster based on the predicates. We take a first step in this di- rection by analysing embedding inversions, as simple vector 4.2.2. LLM and quantum computing surveys arithmetic can provide valuable insights into word relation- ships, such as negation or gender variants [25]. Specifically, For a more in-depth analysis, we consider the triplets ex- we subtract the spaCy embeddings of the predicates and tracted from the LLM and quantum computing surveys and study the tokens closest to the resulting vector. Although their cited papers. In total, 1895 triplets are extracted from SciBERT encodings likely contain more semantic informa- the 188 papers. We cluster these triplets based on the sub- tion, the input sequence is embedded along with its context, jects so we can evaluate the differences between the objects hence it cannot be easily inverted to a token. in a cluster. Figure 4 in the appendix shows that larger clus- ters often have triplets from multiple categories, whereas small clusters tend to have triplets from only one category. 4. Results Next, we make pairwise comparisons between the object embeddings within a cluster. Figure 2 shows the pairwise 4.1. Data distances (L2 norm) between objects for each cluster size. We consider two different datasets to evaluate our method. We find that for clusters below size 8, the distance between First, to leverage in-house expertise in the domain of com- objects from the LLM data and quantum data is larger than puter science and natural language processing (NLP), we the distances of objects within a category. For larger cluster focus on publications relevant to the domain and retrieve sizes, this effect disappears. This indicates that for smaller data from the arXiv categories cs.AI, cs.CL and cs.LG. clusters with triplets from both categories, the objects are Specifically, we retrieve all papers from December 2023, more diverse. Furthermore, we see that for clusters with which amounts to a total of 4225 research articles. triplets from one category, the distance between objects Second, for a quantitative analysis of the usefulness of increases for larger sizes. This confirms that larger clusters triplets for factual consistency evaluation, we consider two are more domain-agnostic and contain more varied objects. surveys. To validate our approach in an independent do- Table 2 shows manually selected clusters with sizes 2, 4 main, we considered both a survey on LLMs and a survey and 8. The column with mixed data clusters shows clus- on Quantum Computing. Specifically, we analyse a survey ters that contain both triplets from the LLM data and the on LLMs by Zhao et al. (2019) [26] and a survey on quantum quantum data, where the triplets from the quantum data computing technologies by Gyongyosi and Imre (2019) [27]. are displayed in italics. For smaller clusters, the triplets We construct a dataset comprising these two surveys and within a dataset tend to be similar and vary only slightly. In the arXiv preprints cited by these surveys. We limit our- contrast, the objects differ more for the mixed data clusters selves to the papers for which an arXiv ID was provided in as they are domain-specific. On the other hand, the results the references of the survey, leading to a total of 188 papers. suggest that larger clusters contain more domain-agnostic In the subsequent section, we refer to the papers related to subjects and objects, such as lab. Consequently, the distance the LLM survey as the LLM data and to the papers related between the objects from different datasets differs less than to the quantum computing survey as the quantum data. between objects from the same data. Further manual inspec- tion supports this hypothesis with large clusters containing subjects such as appendix and conclusion. 87 Table 2 Manually selected clusters of sizes 2, 4 and 8. In the clusters with mixed data, the italic triplets belong to the quantum data and the regular triplets to the LLM data. Cluster size Pure LLM cluster Pure quantum cluster Mixed data cluster (expert knowledge, enhance, data utility), (bundle map, determine, representation), (minimizer, satisfies, argmin), 2 (expert knowledge, employed, data utility) (tangent map, analyse, map) (minimizer, given, tetrahedron) (reinforcement learning, requires, language model), (complement, express, failure), (time duration, cover, feature), (reinforcement learning, based, sampling algorithm), (complement, contains, function), (spn value, shown, symbol), 5 (reinforcement learning, learns, reward model), (complement, must contain, part), (duration value, have, spacing), (reinforcement learning, offer, evaluation), (complement, not capture, complement), (dene value, introduce, reduction relation), (reinforcement learning, inherits, drawback training instability) (complement, not represented, set) (dene value, dene, progress relation) (neuron, receives, impulse), (alice protocol, avoids, computing requirement), (lab, improved, learning), (neuron, displayed, difference), (alice protocol, ha, difference), (lab, offered, value), (neuron, reaching, average rate), (alice protocol, requires, bob), (lab, had, benefit), (neuron, not, impact), (alice protocol, consumes, network bandwith), (lab, offer, introduction), 8 (neuron, coupled, weight wji), (alice protocol, reduce, quantum computation), (lab, present, environment), (neuron, changed, activity), (alice protocol, ha, advantage), (lab, improve, performance), (neuron, described, pair), (alice protocol, provides, fault tolerance), (lab, demonstrates, concept), (neuron, sends, impulse), (alice protocol, preserve, tolerance ability) (lab, outline, qml solution) (neuron, make, decision) marked to be inconsistent with (initialization, accelerate, optimization process). This again shows that refinement is needed to extract more specific triplets. 4.3.2. Embedding inversion To do a more qualitative assessment, we invert the differ- ences of the embeddings of predicates from the same cluster. Table 6 in the appendix shows a manual selection of 9 of these embedding inversions. The results show that an em- bedding inversion does not provide informative results in this context. In general, we do not find that there is a notice- able difference between embedding inversions of predicates that are consistent or inconsistent. Figure 2: The average distance (L2 norm) between objects of triplets for different cluster sizes 5. Conclusion Overall, we argue that it means that triplets as extracted This paper presents an unsupervised method for the ex- by our pipelines can be used as proxies for factual consis- traction of triplets from scientific work. Whereas previ- tency, but that additional refinement is needed to avoid ous methods either require labeled training data or the pre- extracting overly generic statements. specification of entity relations, we allow for entity triplet extraction through domain specification. The results show that the extracted triplets accurately 4.3. Factual consistency reflect the domain from the corresponding scientific work. 4.3.1. Predicate comparisons When we cluster triplets based on the subject, we find that smaller clusters tend to be domain-specific. In contrast, As a first step in evaluating the factual consistency between larger clusters are more generic and often contain triplets papers, triplets in the same cluster are compared based from different domains. We interpret it as extracted triplets on the predicates. Specifically, two triplets are considered being suitable for evaluating factual consistency, but requir- consistent when the predicates are synonyms, hypernyms ing further refinement for a more specific extraction. We or hyponyms. If the predicates are antonyms, the triplets believe this is due to insufficient resolution of excessively are considered inconsistent. The VerbOcean and WordNet general nouns (e.g. lab, conclusion). To compare the triplets, databases are used to label pairs of predicates [28, 29]. an embedding inversion was implemented on the difference of the verb embeddings for similar triplets. Our findings Table 3 show that an embedding inversion does not allow us to Number of consistent and inconsistent triplet pairs. discriminate between consistent and inconsistent triplets. Inconsistent triplet pairs Consistent triplet pairs Our results suggest that the next steps for the usage of Across papers 2044 434362 the extracted triplets for the development of LLM-resilient Within papers 175 4549 proxies should focus on better filtering of domain-agnostic subjects, for them to be informative about factual consis- Table 3 shows that the majority of triplets within a cluster tency. Then, a semantic network can be built based on the are consistent. We see both within and across papers that similarities between the triplets for the entirety of the sci- there are inconsistent triplets present. However, manual entific publications in a domain of interest. By leveraging inspection shows that, in some cases, an inconsistent pair this network, we can identify papers that are factually in- of triplets can be caused by differing contexts. For exam- consistent or excessively consistent and use the remainder ple, the pair (initialization, impede, optimization process) is of the corpus for a bibliometric analysis. 88 References A. BugarΓ­n, J. Lang (Eds.), ECAI 2020 - 24th Euro- pean Conference on Artificial Intelligence, 29 August-8 [1] Y. Zhang, A. L. Porter, S. Cunningham, D. Chiavetta, September 2020, Santiago de Compostela, Spain, Au- N. Newman, Parallel or intersecting lines? intelli- gust 29 - September 8, 2020 - Including 10th Confer- gent bibliometrics for investigating the involvement ence on Prestigious Applications of Artificial Intelli- of data science in policy analysis, IEEE Transactions gence (PAIS 2020), volume 325 of Frontiers in Artifi- on Engineering Management 68 (2020) 1259–1271. cial Intelligence and Applications, IOS Press, 2020, pp. [2] D. Percia David, L. MarΓ©chal, W. Lacube, S. Gillard, 2006–2013. URL: https://doi.org/10.3233/FAIA200321. M. Tsesmelis, T. Maillart, A. Mermoud, Measur- doi:10.3233/FAIA200321. ing security development in information technolo- [13] L. Liu, A. Omidvar, Z. Ma, A. Agrawal, A. An, Unsu- gies: A scientometric framework using arxiv e-prints, pervised knowledge graph generation using semantic Technological Forecasting and Social Change 188 similarity matching, in: C. Cherry, A. Fan, G. Fos- (2023) 122316. URL: https://www.sciencedirect.com/ ter, G. R. Haffari, S. Khadivi, N. V. Peng, X. Ren, science/article/pii/S004016252300001X. doi:https:// E. Shareghi, S. Swayamdipta (Eds.), Proceedings of doi.org/10.1016/j.techfore.2023.122316. the Third Workshop on Deep Learning for Low- [3] M. A. Hanson, P. G. Barreiro, P. Crosetto, Resource Natural Language Processing, Association D. Brockington, The strain on scientific pub- for Computational Linguistics, Hybrid, 2022, pp. 169– lishing, CoRR abs/2309.15884 (2023). URL: https: 179. URL: https://aclanthology.org/2022.deeplo-1.18. //doi.org/10.48550/arXiv.2309.15884. doi:10.48550/ doi:10.18653/v1/2022.deeplo-1.18. ARXIV.2309.15884. arXiv:2309.15884. [14] Artifex, Pymupdf, https://pypi.org/project/PyMuPDF/, [4] J. P. A. Ioannidis, R. Klavans, K. W. Boyack, Thousands 2024. Accessed: 2024-02-29. of scientists publish a paper every five days, Nature 561 [15] A. S. Schwartz, M. A. Hearst, A simple algo- (2018) 167 – 169. URL: https://api.semanticscholar.org/ rithm for identifying abbreviation definitions in CorpusID:52198631. biomedical text, in: R. B. Altman, A. K. Dunker, [5] M. Majovsky, M. ČernΓ½, M. Kasal, M. Komarc, D. Ne- L. Hunter, T. E. Klein (Eds.), Proceedings of the 8th tuka, Artificial intelligence can generate fraudulent Pacific Symposium on Biocomputing, PSB 2003, Lihue, but authentic-looking scientific medical articles: Pan- Hawaii, USA, January 3-7, 2003, 2003, pp. 451–462. dora’s box has been opened, Journal of Medical Inter- URL: http://psb.stanford.edu/psb-online/proceedings/ net Research 25 (2023). doi:10.2196/46924. psb03/schwartz.pdf. [6] C. Chen, K. Shu, Can llm-generated misinformation [16] M. Neumann, D. King, I. Beltagy, W. Ammar, ScispaCy: be detected?, 2023. arXiv:2309.13788. Fast and robust models for biomedical natural lan- [7] D. S. G. Henrique, A. Kucharavy, R. Guerraoui, Stochas- guage processing, in: D. Demner-Fushman, K. B. Co- tic parrots looking for stochastic parrots: Llms are easy hen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the to fine-tune and hard to detect with other llms, 2023. 18th BioNLP Workshop and Shared Task, Association arXiv:2304.08968. for Computational Linguistics, Florence, Italy, 2019, [8] Clarivate, 2024 journal citation reports, pp. 319–327. URL: https://aclanthology.org/W19-5034. https://clarivate.com/blog/2024-journal-citation- doi:10.18653/v1/W19-5034. reports-changes-in-journal-impact-factor-category- [17] H. B. Travis Hoppe, Nlpre, https://github.com/ rankings-to-enhance-transparency-and-inclusivity/, NIHOPA/NLPre, 2024. Accessed: 2024-04-15. 2024. Accessed: 2024-02-29. [18] X. Wei, M. R. U. Hoque, J. Wu, J. Li, Claimdistiller: [9] X. Li, G. A. Burns, N. Peng, Scientific discourse tag- Scientific claim extraction with supervised contrastive ging for evidence extraction, in: P. Merlo, J. Tiede- learning, in: C. Zhang, Y. Zhang, P. Mayr, W. Lu, mann, R. Tsarfaty (Eds.), Proceedings of the 16th A. Suominen, H. Chen, Y. Ding (Eds.), Proceedings Conference of the European Chapter of the Associ- of Joint Workshop of the 4th Extraction and Evalua- ation for Computational Linguistics: Main Volume, tion of Knowledge Entities from Scientific Documents EACL 2021, Online, April 19 - 23, 2021, Association (EEKE2023) and the 3rd AI + Informetrics (AII2023) for Computational Linguistics, 2021, pp. 2550–2562. co-located with the JCDL 2023, Santa Fe, New Mexico, URL: https://doi.org/10.18653/v1/2021.eacl-main.218. USA and Online, 26 June, 2023, volume 3451 of CEUR doi:10.18653/V1/2021.EACL-MAIN.218. Workshop Proceedings, CEUR-WS.org, 2023, pp. 65–77. [10] A. Binder, B. Verma, L. Hennig, Full-text URL: https://ceur-ws.org/Vol-3451/paper11.pdf. argumentation mining on scientific publica- [19] A. Fergadis, D. Pappas, A. Karamolegkou, H. Pa- tions, CoRR abs/2210.13084 (2022). URL: https: pageorgiou, Argumentation mining in scien- //doi.org/10.48550/arXiv.2210.13084. doi:10.48550/ tific literature for sustainable development, in: ARXIV.2210.13084. arXiv:2210.13084. K. Al-Khatib, Y. Hou, M. Stede (Eds.), Proceed- [11] A. Bastos, A. Nadgeri, K. Singh, I. O. Mulang, ings of the 8th Workshop on Argument Mining, S. Shekarpour, J. Hoffart, M. Kaul, Recon: Re- Association for Computational Linguistics, Punta lation extraction using knowledge graph context Cana, Dominican Republic, 2021, pp. 100–111. in a graph neural network, in: Proceedings of URL: https://aclanthology.org/2021.argmining-1.10. the Web Conference 2021, WWW ’21, Association doi:10.18653/v1/2021.argmining-1.10. for Computing Machinery, New York, NY, USA, [20] F. Dernoncourt, J. Y. Lee, PubMed 200k RCT: a 2021, p. 1673–1685. URL: https://doi.org/10.1145/ dataset for sequential sentence classification in med- 3442381.3449917. doi:10.1145/3442381.3449917. ical abstracts, in: G. Kondrak, T. Watanabe (Eds.), [12] M. Eberts, A. Ulges, Span-based joint entity and rela- Proceedings of the Eighth International Joint Con- tion extraction with transformer pre-training, in: G. D. ference on Natural Language Processing (Volume 2: Giacomo, A. CatalΓ‘, B. Dilkina, M. Milano, S. Barro, Short Papers), Asian Federation of Natural Language 89 Processing, Taipei, Taiwan, 2017, pp. 308–313. URL: Spain, 2004, pp. 33–40. URL: https://aclanthology.org/ https://aclanthology.org/I17-2052. W04-3205. [21] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, [30] A. Kucharavy, Z. Schillaci, L. MarΓ©chal, M. WΓΌrsch, spaCy: Industrial-strength Natural Language Process- L. Dolamic, R. Sabonnadiere, D. P. David, A. Mer- ing in Python (2020). doi:10.5281/zenodo.1212303. moud, V. Lenders, Fundamentals of generative large [22] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language models and perspectives in cyber-defense, language model for scientific text, in: K. Inui, 2023. arXiv:2303.12132. J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–3620. URL: https:// aclanthology.org/D19-1371. doi:10.18653/v1/D19- 1371. [23] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. doi:10.18653/V1/N19-1423. [24] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber- land, T. Reddy, D. Cournapeau, E. Burovski, P. Pe- terson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, Δ°. Po- lat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quin- tero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, SciPy 1.0 Contribu- tors, SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods 17 (2020) 261– 272. URL: https://doi.org/10.1038/s41592-019-0686-2. doi:10.1038/s41592-019-0686-2. [25] K. Ethayarajh, D. Duvenaud, G. Hirst, Towards un- derstanding linear word analogies, in: A. Korhonen, D. Traum, L. MΓ rquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguis- tics, Florence, Italy, 2019, pp. 3253–3262. URL: https: //aclanthology.org/P19-1315. doi:10.18653/v1/P19- 1315. [26] L. Gyongyosi, S. Imre, A survey on quantum com- puting technology, Comput. Sci. Rev. 31 (2019) 51– 71. URL: https://doi.org/10.1016/j.cosrev.2018.11.002. doi:10.1016/J.COSREV.2018.11.002. [27] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, J. Wen, A survey of large lan- guage models, CoRR abs/2303.18223 (2023). URL: https: //doi.org/10.48550/arXiv.2303.18223. doi:10.48550/ ARXIV.2303.18223. arXiv:2303.18223. [28] G. A. Miller, WordNet: A lexical database for english, Communications of the ACM 38 (1995) 39–41. [29] T. Chklovski, P. Pantel, VerbOcean: Mining the web for fine-grained semantic verb relations, in: D. Lin, D. Wu (Eds.), Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Barcelona, 90 6. Appendix 6.1. Abbreviation detection algorithms During the preprocessing of papers, we expand abbreviations and map them to their long form. To compare the performance of different abbreviation detection algorithms, we evaluate them on the paper Fundamentals of Generative Large Language Models and Perspectives in Cyber-Defense, of which we have a thorough understanding [30]. Table 4 shows the performance of the Schwartz-Hearst [15], scispaCy [16] and NLPRe [17] abbreviation detection methods. The results show that the Schwartz-Hearst algorithm performs the best, though the scispaCy implementation has a similar performance. However, the Schwartz-Hearst algorithm is much faster, so we chose this approach. Table 4 Performance of three abbreviation detection algorithms on the paper Fundamentals of Generative Large Language Models and Perspectives in Cyber-Defense [30]. Correctly detected abbreviations Falsely detected abbreviations Processing time Schwartz-Hearst 11 1 0.04 s scispaCy 11 4 8.73 s NLPRe 0 0 0.01 s 6.2. Triplet extraction Figure 3 below illustrates the number of triplets extracted per paper. The results indicate that for most papers, less than 30 triplets are extracted. However, the right tail is long, which shows that there are outliers for which over 200 triplets are extracted. Table 5 shows the hyperparameters for the extraction of triplets for the arXiv papers with categories cs.AI, cs.CL and cs.LG from December 2023. Figure 3: Histogram of the number of triplets extracted per paper from the categories cs.AI, c.CL and cs.LG in December 2023 Table 5 Hyperparameter setting for the triplet extraction Parameter value Maximum length triplet component 6 Threshold claim extraction 0.05 Threshold for book corpus filtering 0.10 Threshold for arXiv corpus filtering 10 Threshold subject clustering 0.1 Threshold object clustering 0.1 6.3. Triplet clustering Figure 4 shows for each cluster size the fraction of the clusters that only contains triplets from one category of data and the fraction that has triplets from both categories. The results show that smaller clusters more often tend to have triplets from one category, whereas larger clusters are more often mixed. Figure 5 shows the dendograms for the clustering of the subjects and objects. We have chosen a cutoff of 0.10 for both, as this maintained a high similarity for the subjects and objects within the same cluster. 91 Figure 4: Fraction of the clusters with all triplets belonging to one category and fraction of clusters with triplets from both categories. Figure 5: Dendrogram of the clustering of a subset of 350 subjects and 350 objects from the arXiv papers from the categories cs.AI, c.CL and cs.LG in December 2023 6.4. Embedding inversion Table 6 shows 8 examples of embedding inversions, where the 10 most similar tokens are presented. Furthermore, the distance between the verbs (1 minus the cosine similarity) is shown. We find that the embedding inversions are not clearly interpretable, as the top 10 embedding inversions do not reflect the differences between the predicates. The embedding inversions are either similar to one of the two predicates, or are seemingly unrelated to both. Therefore, the embedding inversion cannot be used to assess whether triplets are aligned or contradictory. 92 Table 6 Manually selected triplets from the arXiv categories cs.AI, cs.CL, cs.LG from December 2023. The distance is defined as 1 minus the cosine similarity between the verb embeddings. Original triplets Verb 1 Verb 2 Top 10 embedding inversions Distance ILLUSTRATES, ILLUSTRATING, schematically, EXEMPLARY, (example, illustrates, behavior), illustrates - mimic SUMMARIZES, DEPICTS, 0.75 (example, mimic, behavior) EMBODIMENT, ILLUSTRATED, ILLUSTRATIVE, DESCRIBES rigamarole, ERRAND, busywork, (architecture, accomplishes, score), AFTERWORDS, canvasing, thigns, accomplishes - achieves 0.22 (architecture, achieves, score) harrasing, forementioned, explaning, Busy-Work PERFORMS, PERFORMING, (type rnns, perform, baseline model), PERFORMED, PERFORM, SINGS, performs - outperform 0.32 (type rnns, outperform, baseline model) CONCERT, ACTs, SONG, RENDITION, PLAYS COMPLIANCE, COMPLY, INSUFFICIENT, (subgradient method, not ensure, convergence), IDENTIFIED, ENSURE, AUDIT, not ensure - enjoy 0.68 (gradient algorithm, enjoy, convergence) INDICATED, NON-COMPLIANCE, DETERMINES, IMPROPERLY EXTRACT, EXTRACTS, DECOCTION, (image representation, extract, concept), extract - capture TINCTURE, GINSENG, TURMERIC, 0.49 (image representation, capture, concept) Comfrey, KOLA, ALOE, Stevia INCURS,Accrues, ASCERTAINS, INCUR, (language model, incurs, cost), incurs - slash INCURRING, incure, howsoever, 0.79 (language model, slash, cost) INCURRED, internalizes, Indemnified REPRESENTS, REPRESENTED, (text, represents, knowledge), REPRESENTING, RepresENT, represents - requires 0.45 (text, requires, knowledge) ABSCISSA, symbolises, PERSONIFIES, DEPICTS, symbolised, Respresents DEMONSTRATES, demonstates, demostrates, (knowledge transfer, demonstrates, improvement), EXEMPLIFIES, Dissects, Elucidates, demonstrates - not maintain 0.49 (knowledge transfer, not maintain, improvement) DECONSTRUCTS, ILLUSTRATES, explicates, EXPLORES 93