LLM-Resilient Bibliometrics:
                         Factual Consistency Through Entity Triplet Extraction⋆
                         Alexander Sternfeld1 , Andrei Kucharavy2 , Dimitri Percia David2 , Alain Mermoud1 and
                         Julian Jang-Jaccard1
                         1
                             Cyber-Defence Campus, armasuisse, Science and Technology, Thun, Switzerland
                         2
                             Institute of Entrepreneurship Management, HES-SO Valais-Wallis


                                             Abstract
                                             The increase in power and availability of Large Language Models (LLMs) since late 2022 led to increased concerns with their usage to
                                             automate academic paper mills. In turn, this poses a threat to bibliometrics-based technology monitoring and forecasting in rapidly
                                             moving fields. We propose to address this issue by leveraging semantic entity triplets. Specifically, we extract factual statements
                                             from scientific papers and represent them as (subject, predicate, object) triplets before validating the factual consistency of statements
                                             within and between scientific papers. This approach heavily penalizes blind usage of stochastic text generators such as LLMs while not
                                             penalizing authors who used LLMs solely to improve the readability of their paper. Here, we present a pipeline to extract such triplets
                                             and compare them. While our pipeline is promising and sensitive enough to detect inconsistencies between papers from different
                                             domains, the intra-paper entity reference resolution needs to be improved to ensure that triplets are more specific. We believe that our
                                             pipeline will be useful to the general research community working on the factual consistency of scientific texts.

                                             Keywords
                                             Bibliometrics, Entity Extraction, Machine Learning, Technological Forecasting, Quantum Computing


                         1. Introduction                                                                                                          security and machine learning. Because of these factors, we
                                                                                                                                                  investigate if factual consistency could be used for LLM-
                         For firms to make informed investment decisions, sound                                                                   resilient bibliometrics instead.
                         forecasts on the development of technologies are necessary.                                                                 Specifically, we represent facts as entity triplets of the
                         One prominent method for technology forecasting is bib-                                                                  form (subject, predicate, object) that are extracted from the
                         liometrics, which uses the information in scholarly books                                                                claims of the paper. The entity triplet plays a crucial role as
                         and journals [1]. Modern bibliometric methods leverage                                                                   it serves as a proxy to understand the primary claims of the
                         the increase in available data by applying machine-learning                                                              paper and subsequently validates factual consistency com-
                         methods. For example, Percia David et al. (2023) analyse                                                                 pared to other works in the domain. Our paper describes the
                         arXiv pre-prints to evaluate the security development of                                                                 workflow involved in entity triplet extraction and provides
                         information technologies [2].                                                                                            an overview of our initial findings regarding the effective-
                            While scientific publications are thus increasingly more                                                              ness of the entity triplets and their relation to the number
                         important for technology forecasting, the quality of the                                                                 of clusters generated around the subject. The code of this
                         papers must be evaluated critically. The publish-or-perish                                                               project is available at https://github.com/technometrics-lab/
                         pressure led to a record growth in the number of scien-                                                                  0-Factual_Consistency_Through_Entity_Triplets, at com-
                         tific publications per author, often with minimal peer re-                                                               mit c7b01e4.
                         view [3, 4]. In such a setting, if LLMs can generate text that
                         sufficiently resembles a scientific article to pass for one on a
                         cursory reading, they are likely to be used to generate scores                                                           2. Related work
                         of articles. Unfortunately, this eventuality is already likely
                         to be a reality, given that Majovsky et al. (2023) showed                                                                Previous approaches for claim extraction can be categorized
                         that ChatGPT can create an authentic-looking neurosurgery                                                                into heuristics and machine learning methods. The advan-
                         scientific article [5].                                                                                                  tage of an approach based on heuristics is that no training
                            Recently, there has been a growing interest in identifying                                                            data is required and the computational cost tends to be low.
                         text generated by LLMs. As early as 2019, Zellers et al.                                                                 However, machine learning approaches can capture more
                         showed that a GPT2-like LLM Grover could detect its own                                                                  complex patterns, leading to the extraction of triplets of
                         output. However, recent research suggests that, in general,                                                              higher quality. Such methods have been developed most
                         LLM detectors either do not work or are easy to evade [6, 7].                                                            prominently in the biomedical domain. For example, Li et
                         Overall, for a minimally competent attacker who wants to                                                                 al. (2021) use BiLSTMs to extract the factual statements pre-
                         evade detection, LLM detectors cannot be relied upon.                                                                    sented in papers [9]. Although less labeled data is available,
                            Unfortunately, the situation is serious enough for some                                                               there has been work focusing on claim extraction from pa-
                         of the most reputable providers of proxies of the impact                                                                 pers in different domains. For instance, Binder et al. (2022)
                         of scientific articles to have modified their algorithms to                                                              use BiLSTMs for argumentative discourse unit recognition
                         only consider publications adhering to stringent criteria                                                                and argumentative relation extraction [10].
                         [8]. Due to the velocity of innovation and the reliance on                                                                  The majority of existing triplet extraction models use
                         preprint repositories, such an approach is not adapted to                                                                supervised training. Two notable examples are RECON and
                         technology monitoring in the domains adjacent to cyber-                                                                  sPERT, which require labeled training data [11, 12]. The
                                                                                                                                                  disadvantage of supervised methods is the need for training
                          Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities                                               data and the dependency on the relations that are present in
                          from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024),                                                 the dataset. In contrast, unsupervised methods do not need
                          April 23-24, 2024, Changchun, China and Online                                                                          training data and use either heuristics or machine learning
                          $ alex1.sternfeld@gmail.com (A. Sternfeld)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-        methods to extract triplets. One example of such a model is
                                     tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                             85
Stanford OpenIE, which extractsrelational tuples without                                        the Python library textacy, which is built on spaCy [21]
the need to specify a schema in advance. However, it has                                        and has a built-in extraction method that does not require
been shown that OpenIE tends to extract too aggressively,                                       the specification of relations in advance.
resulting in the presence of non-useful relations [13]. We
contribute by providing a method that can provide triplets                                      3.3. Post-processing of triplets
from scientific papers with a high precision, while the user
only needs to specify the desired research categories.                                          Our goal is to compare triplets across papers. Therefore,
                                                                                                we further process the triplets so that we can pair triplets
                                                                                                from different papers that refer to the same subject. The
3. Methodology                                                                                  following steps are followed:
The process of extracting informative triplets from raw PDFs                                        1. Lowercase all words in the triplet
consists of four main stages. First, PDFs are converted to                                          2. Remove triplets where either the subject or object
text files, after which they are preprocessed to remove word                                           contains more than 6 words
breaks and citations, expand abbreviations and lemmatize                                            3. Remove stopwords from the triplets based on the
words. Then, we extract the sentences from the paper that                                              list included in NLTK
convey its core ideas, which we refer to as claims. From                                            4. Remove any character that is not text
these claims, we can extract subject-predicate-object triplets.
                                                                                                    5. Lemmatize verbs and nouns in the triplet
The last step is to process these triplets further so that they
                                                                                                    6. Remove words containing less than 3 characters
can be used in a comparative analysis. The entire pipeline
is displayed in Figure 1. In the following subsections, we                                          7. Filter the triplets non-specific to scientific work by
elaborate on each of the steps. While we focus on arXiv, the                                           comparison with a general book corpus
approach is generalizable to all scientific PDFs.                                                   8. Filter the triplets characteristic of scientific works
                                                                                                       in general by comparison with arXiv articles from
                                                                                                       different categories
3.1. Preprocessing
                                                                                                   In the second step, we choose this cutoff, as we expect
To convert the pdf to text we use the PyMuPDF library [14].
                                                                                                that phrases of over 6 words may contain nuances that
We then further clean the text by removing bracketed cita-
                                                                                                cannot be captured in a simple subject-predicate-object rela-
tions and merging words that were split due to line breaks.
                                                                                                tion. Words with less than 3 characters are removed, as we
We then expand abbreviations by using a rule-based algo-
                                                                                                observed that such words were often noise. Moreover, as
rithm introduced by Schwartz and Hearst (2003) [15]. In
                                                                                                abbreviations are expanded we expect all informative terms
appendix 6.1 we show that the Schwartz-Hearst algorithm
                                                                                                to be at least of length 3.
outperforms the scispaCy [16] and NLPRe [17] abbreviation
                                                                                                   We use the general-purpose Gutenberg book corpus to
detection methods, which are built on spaCy. Moreover, due
                                                                                                filter the triplets that carry little information. We define
to the rule-based nature of the algorithm, it is relatively
                                                                                                the number of times term 𝑖 appears at least 5 times in a
fast. The example below shows the transition from a raw
                                                                                                document in the book corpus and in the paper corpus as
sentence to a preprocessed result.
                                                                                                𝑓𝑏,𝑖 and 𝑓𝑝,𝑖 , respectively. We then assign a score 𝑠𝑖 to each
                                                                                                term 𝑖:
Table 1
Illustration of the preprocessing steps, the parts in bold are altered
during preprocessing.
                                                                                                    ⎧
                                                                                                    ⎨−∞
                                                                                                    ⎪                             𝑓𝑝,𝑖 < 10
                                                                                                         𝑓             𝑓
Uncleaned                               Preprocessed                                            𝑠𝑖 = 𝑙𝑜𝑔( 𝑁𝑝,𝑖
                                                                                                            𝑝
                                                                                                               ) − 𝑙𝑜𝑔( 𝑁𝑏,𝑖
                                                                                                                          𝑏
                                                                                                                             )    if 𝑓𝑝,𝑖 ≥ 10 and 𝑓𝑏,𝑖 > 0
Society has been affected by            Society has been affected by                                ⎪
                                                                                                      ∞                           if 𝑓𝑏,𝑖 = 0 and 𝑓𝑝,𝑖 ≥ 10
                                                                                                    ⎩
artificial intelligence (AI) and has    artificial intelligence and has become
become more rel- iant on AI products.   more reliant on artificial intelligent products.

                                                                                                   Terms that are not present in at least 10 papers, thus get
                                                                                                a score of −∞. If the term is present in at least 10 papers,
3.2. Claim and triplet extraction                                                               the score increases when the frequency of the term in the
                                                                                                book corpus is lower. We keep the triplets with subjects in
After preprocessing the text, we identify the sentences                                         the top 10% of the term scores.
that convey the authors’ claims. Specifically, we use the                                          In the last step, we aim to keep only the triplets that
ClaimDistiller framework developed by Wei et al. (2023)                                         carry domain-specific information. Therefore, we sample a
[18]. In their work, both CNN’s and BiLSTM’s are trained for                                    random subset of 1000 arXiv papers from December 2023
claim extraction on the PubMED-RCT and SciARK datasets                                          from different categories than our target papers. We then
[19, 20]. Although the usage of supervised contrast training                                    only keep the triplets with subjects present in a maximum
improves the performance of the model, it causes a compu-                                       of 15 papers.
tational overhead. We use the BiLSTM without supervised
contrast learning to strike a balance between performance
                                                                                                3.4. Clustering
and computational efficiency. We choose to extract the
claims from the papers such that the subsequent triplet ex-                                     As we use the extracted triplets to compare papers, it is
traction will have to be performed on fewer sentences.                                          necessary to cluster them based on the subject and object.
   Next, we want to reduce the claims to (subject, predi-                                       Both SciBERT [22] encodings and spaCy [21] embeddings
cate, object) triplets, analogous to the Resource Description                                   were considered. Based on visual inspection, the result-
Framework (RDF) format commonly used in the representa-                                         ing clusters are most coherent when using SciBERT, which
tion of OWL ontologies. We choose this representation, as it                                    is a language model based on BERT [23], pretrained on a
will facilitate the comparison of claims across papers. We use                                  large multi-domain corpus of scientific publications. After


                                                                                           86
Figure 1: The complete pipeline for extracting entity triplets from the raw articles from the arXiv archive


encoding both the subjects and objects, we utilize an agglom-            4.2. Cluster analysis
eration hierarchical clustering algorithm from Scipy [24],
                                                                         4.2.1. CS articles December 2023
which compares the average distance between clusters. By
visually inspecting the dendograms, we set a cutoff to ob-               After preprocessing the articles, we extract the triplets. For
tain the final clusters. Figure 5 in the appendix shows an               the triplet extraction, we used the hyperparameters dis-
example of the dendograms for a subset of the subjects and               played in Table 5 in the appendix. In total, 79,986 triplets
objects. The threshold is chosen at the height where the                 were extracted from the research articles. We cluster these
distance between clusters begins to noticeably increase.                 triplets based on both the subject and object embeddings,
                                                                         which resulted in 37,076 clusters. Figure 3 in the appendix
3.5. Triplet comparison                                                  shows the distribution of the number of triplets per cluster.
                                                                         This shows that most clusters contain less than 25 triplets,
After clustering, we compare the triplets within the same                but that there are outliers that contain over 200 triplets.
cluster based on the predicates. We take a first step in this di-
rection by analysing embedding inversions, as simple vector              4.2.2. LLM and quantum computing surveys
arithmetic can provide valuable insights into word relation-
ships, such as negation or gender variants [25]. Specifically,           For a more in-depth analysis, we consider the triplets ex-
we subtract the spaCy embeddings of the predicates and                   tracted from the LLM and quantum computing surveys and
study the tokens closest to the resulting vector. Although               their cited papers. In total, 1895 triplets are extracted from
SciBERT encodings likely contain more semantic informa-                  the 188 papers. We cluster these triplets based on the sub-
tion, the input sequence is embedded along with its context,             jects so we can evaluate the differences between the objects
hence it cannot be easily inverted to a token.                           in a cluster. Figure 4 in the appendix shows that larger clus-
                                                                         ters often have triplets from multiple categories, whereas
                                                                         small clusters tend to have triplets from only one category.
4. Results                                                                  Next, we make pairwise comparisons between the object
                                                                         embeddings within a cluster. Figure 2 shows the pairwise
4.1. Data                                                                distances (L2 norm) between objects for each cluster size.
We consider two different datasets to evaluate our method.               We find that for clusters below size 8, the distance between
First, to leverage in-house expertise in the domain of com-              objects from the LLM data and quantum data is larger than
puter science and natural language processing (NLP), we                  the distances of objects within a category. For larger cluster
focus on publications relevant to the domain and retrieve                sizes, this effect disappears. This indicates that for smaller
data from the arXiv categories cs.AI, cs.CL and cs.LG.                   clusters with triplets from both categories, the objects are
Specifically, we retrieve all papers from December 2023,                 more diverse. Furthermore, we see that for clusters with
which amounts to a total of 4225 research articles.                      triplets from one category, the distance between objects
   Second, for a quantitative analysis of the usefulness of              increases for larger sizes. This confirms that larger clusters
triplets for factual consistency evaluation, we consider two             are more domain-agnostic and contain more varied objects.
surveys. To validate our approach in an independent do-                     Table 2 shows manually selected clusters with sizes 2, 4
main, we considered both a survey on LLMs and a survey                   and 8. The column with mixed data clusters shows clus-
on Quantum Computing. Specifically, we analyse a survey                  ters that contain both triplets from the LLM data and the
on LLMs by Zhao et al. (2019) [26] and a survey on quantum               quantum data, where the triplets from the quantum data
computing technologies by Gyongyosi and Imre (2019) [27].                are displayed in italics. For smaller clusters, the triplets
We construct a dataset comprising these two surveys and                  within a dataset tend to be similar and vary only slightly. In
the arXiv preprints cited by these surveys. We limit our-                contrast, the objects differ more for the mixed data clusters
selves to the papers for which an arXiv ID was provided in               as they are domain-specific. On the other hand, the results
the references of the survey, leading to a total of 188 papers.          suggest that larger clusters contain more domain-agnostic
In the subsequent section, we refer to the papers related to             subjects and objects, such as lab. Consequently, the distance
the LLM survey as the LLM data and to the papers related                 between the objects from different datasets differs less than
to the quantum computing survey as the quantum data.                     between objects from the same data. Further manual inspec-
                                                                         tion supports this hypothesis with large clusters containing
                                                                         subjects such as appendix and conclusion.


                                                                    87
      Table 2
      Manually selected clusters of sizes 2, 4 and 8. In the clusters with mixed data, the italic triplets belong to the quantum data
      and the regular triplets to the LLM data.
 Cluster size   Pure LLM cluster                                                    Pure quantum cluster                               Mixed data cluster
                (expert knowledge, enhance, data utility),                          (bundle map, determine, representation),           (minimizer, satisfies, argmin),
 2
                (expert knowledge, employed, data utility)                          (tangent map, analyse, map)                        (minimizer, given, tetrahedron)

                (reinforcement learning, requires, language model),                 (complement, express, failure),                    (time duration, cover, feature),
                (reinforcement learning, based, sampling algorithm),                (complement, contains, function),                  (spn value, shown, symbol),
 5              (reinforcement learning, learns, reward model),                     (complement, must contain, part),                  (duration value, have, spacing),
                (reinforcement learning, offer, evaluation),                        (complement, not capture, complement),             (dene value, introduce, reduction relation),
                (reinforcement learning, inherits, drawback training instability)   (complement, not represented, set)                 (dene value, dene, progress relation)

                (neuron, receives, impulse),
                                                                                    (alice protocol, avoids, computing requirement),   (lab, improved, learning),
                (neuron, displayed, difference),
                                                                                    (alice protocol, ha, difference),                  (lab, offered, value),
                (neuron, reaching, average rate),
                                                                                    (alice protocol, requires, bob),                   (lab, had, benefit),
                (neuron, not, impact),
                                                                                    (alice protocol, consumes, network bandwith),      (lab, offer, introduction),
 8              (neuron, coupled, weight wji),
                                                                                    (alice protocol, reduce, quantum computation),     (lab, present, environment),
                (neuron, changed, activity),
                                                                                    (alice protocol, ha, advantage),                   (lab, improve, performance),
                (neuron, described, pair),
                                                                                    (alice protocol, provides, fault tolerance),       (lab, demonstrates, concept),
                (neuron, sends, impulse),
                                                                                    (alice protocol, preserve, tolerance ability)      (lab, outline, qml solution)
                (neuron, make, decision)


                                                                                             marked to be inconsistent with (initialization, accelerate,
                                                                                             optimization process). This again shows that refinement is
                                                                                             needed to extract more specific triplets.

                                                                                             4.3.2. Embedding inversion
                                                                                             To do a more qualitative assessment, we invert the differ-
                                                                                             ences of the embeddings of predicates from the same cluster.
                                                                                             Table 6 in the appendix shows a manual selection of 9 of
                                                                                             these embedding inversions. The results show that an em-
                                                                                             bedding inversion does not provide informative results in
                                                                                             this context. In general, we do not find that there is a notice-
                                                                                             able difference between embedding inversions of predicates
                                                                                             that are consistent or inconsistent.
Figure 2: The average distance (L2 norm) between objects of
triplets for different cluster sizes
                                                                                             5. Conclusion
  Overall, we argue that it means that triplets as extracted                                 This paper presents an unsupervised method for the ex-
by our pipelines can be used as proxies for factual consis-                                  traction of triplets from scientific work. Whereas previ-
tency, but that additional refinement is needed to avoid                                     ous methods either require labeled training data or the pre-
extracting overly generic statements.                                                        specification of entity relations, we allow for entity triplet
                                                                                             extraction through domain specification.
                                                                                                The results show that the extracted triplets accurately
4.3. Factual consistency                                                                     reflect the domain from the corresponding scientific work.
4.3.1. Predicate comparisons                                                                 When we cluster triplets based on the subject, we find that
                                                                                             smaller clusters tend to be domain-specific. In contrast,
As a first step in evaluating the factual consistency between                                larger clusters are more generic and often contain triplets
papers, triplets in the same cluster are compared based                                      from different domains. We interpret it as extracted triplets
on the predicates. Specifically, two triplets are considered                                 being suitable for evaluating factual consistency, but requir-
consistent when the predicates are synonyms, hypernyms                                       ing further refinement for a more specific extraction. We
or hyponyms. If the predicates are antonyms, the triplets                                    believe this is due to insufficient resolution of excessively
are considered inconsistent. The VerbOcean and WordNet                                       general nouns (e.g. lab, conclusion). To compare the triplets,
databases are used to label pairs of predicates [28, 29].                                    an embedding inversion was implemented on the difference
                                                                                             of the verb embeddings for similar triplets. Our findings
Table 3                                                                                      show that an embedding inversion does not allow us to
Number of consistent and inconsistent triplet pairs.                                         discriminate between consistent and inconsistent triplets.
                    Inconsistent triplet pairs        Consistent triplet pairs                  Our results suggest that the next steps for the usage of
Across papers                   2044                             434362
                                                                                             the extracted triplets for the development of LLM-resilient
Within papers                   175                               4549                       proxies should focus on better filtering of domain-agnostic
                                                                                             subjects, for them to be informative about factual consis-
   Table 3 shows that the majority of triplets within a cluster                              tency. Then, a semantic network can be built based on the
are consistent. We see both within and across papers that                                    similarities between the triplets for the entirety of the sci-
there are inconsistent triplets present. However, manual                                     entific publications in a domain of interest. By leveraging
inspection shows that, in some cases, an inconsistent pair                                   this network, we can identify papers that are factually in-
of triplets can be caused by differing contexts. For exam-                                   consistent or excessively consistent and use the remainder
ple, the pair (initialization, impede, optimization process) is                              of the corpus for a bibliometric analysis.


                                                                                       88
References                                                                 A. Bugarín, J. Lang (Eds.), ECAI 2020 - 24th Euro-
                                                                           pean Conference on Artificial Intelligence, 29 August-8
 [1] Y. Zhang, A. L. Porter, S. Cunningham, D. Chiavetta,                  September 2020, Santiago de Compostela, Spain, Au-
     N. Newman, Parallel or intersecting lines? intelli-                   gust 29 - September 8, 2020 - Including 10th Confer-
     gent bibliometrics for investigating the involvement                  ence on Prestigious Applications of Artificial Intelli-
     of data science in policy analysis, IEEE Transactions                 gence (PAIS 2020), volume 325 of Frontiers in Artifi-
     on Engineering Management 68 (2020) 1259–1271.                        cial Intelligence and Applications, IOS Press, 2020, pp.
 [2] D. Percia David, L. Maréchal, W. Lacube, S. Gillard,                  2006–2013. URL: https://doi.org/10.3233/FAIA200321.
     M. Tsesmelis, T. Maillart, A. Mermoud, Measur-                        doi:10.3233/FAIA200321.
     ing security development in information technolo-                [13] L. Liu, A. Omidvar, Z. Ma, A. Agrawal, A. An, Unsu-
     gies: A scientometric framework using arxiv e-prints,                 pervised knowledge graph generation using semantic
     Technological Forecasting and Social Change 188                       similarity matching, in: C. Cherry, A. Fan, G. Fos-
     (2023) 122316. URL: https://www.sciencedirect.com/                    ter, G. R. Haffari, S. Khadivi, N. V. Peng, X. Ren,
     science/article/pii/S004016252300001X. doi:https://                   E. Shareghi, S. Swayamdipta (Eds.), Proceedings of
     doi.org/10.1016/j.techfore.2023.122316.                               the Third Workshop on Deep Learning for Low-
 [3] M. A. Hanson, P. G. Barreiro, P. Crosetto,                            Resource Natural Language Processing, Association
     D. Brockington,          The strain on scientific pub-                for Computational Linguistics, Hybrid, 2022, pp. 169–
     lishing, CoRR abs/2309.15884 (2023). URL: https:                      179. URL: https://aclanthology.org/2022.deeplo-1.18.
     //doi.org/10.48550/arXiv.2309.15884. doi:10.48550/                    doi:10.18653/v1/2022.deeplo-1.18.
     ARXIV.2309.15884. arXiv:2309.15884.                              [14] Artifex, Pymupdf, https://pypi.org/project/PyMuPDF/,
 [4] J. P. A. Ioannidis, R. Klavans, K. W. Boyack, Thousands               2024. Accessed: 2024-02-29.
     of scientists publish a paper every five days, Nature 561        [15] A. S. Schwartz, M. A. Hearst,          A simple algo-
     (2018) 167 – 169. URL: https://api.semanticscholar.org/               rithm for identifying abbreviation definitions in
     CorpusID:52198631.                                                    biomedical text, in: R. B. Altman, A. K. Dunker,
 [5] M. Majovsky, M. Černý, M. Kasal, M. Komarc, D. Ne-                    L. Hunter, T. E. Klein (Eds.), Proceedings of the 8th
     tuka, Artificial intelligence can generate fraudulent                 Pacific Symposium on Biocomputing, PSB 2003, Lihue,
     but authentic-looking scientific medical articles: Pan-               Hawaii, USA, January 3-7, 2003, 2003, pp. 451–462.
     dora’s box has been opened, Journal of Medical Inter-                 URL: http://psb.stanford.edu/psb-online/proceedings/
     net Research 25 (2023). doi:10.2196/46924.                            psb03/schwartz.pdf.
 [6] C. Chen, K. Shu, Can llm-generated misinformation                [16] M. Neumann, D. King, I. Beltagy, W. Ammar, ScispaCy:
     be detected?, 2023. arXiv:2309.13788.                                 Fast and robust models for biomedical natural lan-
 [7] D. S. G. Henrique, A. Kucharavy, R. Guerraoui, Stochas-               guage processing, in: D. Demner-Fushman, K. B. Co-
     tic parrots looking for stochastic parrots: Llms are easy             hen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the
     to fine-tune and hard to detect with other llms, 2023.                18th BioNLP Workshop and Shared Task, Association
     arXiv:2304.08968.                                                     for Computational Linguistics, Florence, Italy, 2019,
 [8] Clarivate,      2024      journal     citation   reports,             pp. 319–327. URL: https://aclanthology.org/W19-5034.
     https://clarivate.com/blog/2024-journal-citation-                     doi:10.18653/v1/W19-5034.
     reports-changes-in-journal-impact-factor-category-               [17] H. B. Travis Hoppe, Nlpre, https://github.com/
     rankings-to-enhance-transparency-and-inclusivity/,                    NIHOPA/NLPre, 2024. Accessed: 2024-04-15.
     2024. Accessed: 2024-02-29.                                      [18] X. Wei, M. R. U. Hoque, J. Wu, J. Li, Claimdistiller:
 [9] X. Li, G. A. Burns, N. Peng, Scientific discourse tag-                Scientific claim extraction with supervised contrastive
     ging for evidence extraction, in: P. Merlo, J. Tiede-                 learning, in: C. Zhang, Y. Zhang, P. Mayr, W. Lu,
     mann, R. Tsarfaty (Eds.), Proceedings of the 16th                     A. Suominen, H. Chen, Y. Ding (Eds.), Proceedings
     Conference of the European Chapter of the Associ-                     of Joint Workshop of the 4th Extraction and Evalua-
     ation for Computational Linguistics: Main Volume,                     tion of Knowledge Entities from Scientific Documents
     EACL 2021, Online, April 19 - 23, 2021, Association                   (EEKE2023) and the 3rd AI + Informetrics (AII2023)
     for Computational Linguistics, 2021, pp. 2550–2562.                   co-located with the JCDL 2023, Santa Fe, New Mexico,
     URL: https://doi.org/10.18653/v1/2021.eacl-main.218.                  USA and Online, 26 June, 2023, volume 3451 of CEUR
     doi:10.18653/V1/2021.EACL-MAIN.218.                                   Workshop Proceedings, CEUR-WS.org, 2023, pp. 65–77.
[10] A. Binder, B. Verma, L. Hennig,                 Full-text             URL: https://ceur-ws.org/Vol-3451/paper11.pdf.
     argumentation mining on scientific publica-                      [19] A. Fergadis, D. Pappas, A. Karamolegkou, H. Pa-
     tions,      CoRR abs/2210.13084 (2022). URL: https:                   pageorgiou,        Argumentation mining in scien-
     //doi.org/10.48550/arXiv.2210.13084. doi:10.48550/                    tific literature for sustainable development,        in:
     ARXIV.2210.13084. arXiv:2210.13084.                                   K. Al-Khatib, Y. Hou, M. Stede (Eds.), Proceed-
[11] A. Bastos, A. Nadgeri, K. Singh, I. O. Mulang,                        ings of the 8th Workshop on Argument Mining,
     S. Shekarpour, J. Hoffart, M. Kaul, Recon: Re-                        Association for Computational Linguistics, Punta
     lation extraction using knowledge graph context                       Cana, Dominican Republic, 2021, pp. 100–111.
     in a graph neural network, in: Proceedings of                         URL: https://aclanthology.org/2021.argmining-1.10.
     the Web Conference 2021, WWW ’21, Association                         doi:10.18653/v1/2021.argmining-1.10.
     for Computing Machinery, New York, NY, USA,                      [20] F. Dernoncourt, J. Y. Lee, PubMed 200k RCT: a
     2021, p. 1673–1685. URL: https://doi.org/10.1145/                     dataset for sequential sentence classification in med-
     3442381.3449917. doi:10.1145/3442381.3449917.                         ical abstracts, in: G. Kondrak, T. Watanabe (Eds.),
[12] M. Eberts, A. Ulges, Span-based joint entity and rela-                Proceedings of the Eighth International Joint Con-
     tion extraction with transformer pre-training, in: G. D.              ference on Natural Language Processing (Volume 2:
     Giacomo, A. Catalá, B. Dilkina, M. Milano, S. Barro,                  Short Papers), Asian Federation of Natural Language


                                                                 89
     Processing, Taipei, Taiwan, 2017, pp. 308–313. URL:                   Spain, 2004, pp. 33–40. URL: https://aclanthology.org/
     https://aclanthology.org/I17-2052.                                    W04-3205.
[21] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd,              [30] A. Kucharavy, Z. Schillaci, L. Maréchal, M. Würsch,
     spaCy: Industrial-strength Natural Language Process-                  L. Dolamic, R. Sabonnadiere, D. P. David, A. Mer-
     ing in Python (2020). doi:10.5281/zenodo.1212303.                     moud, V. Lenders, Fundamentals of generative large
[22] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained                    language models and perspectives in cyber-defense,
     language model for scientific text, in: K. Inui,                      2023. arXiv:2303.12132.
     J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
     2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint
     Conference on Natural Language Processing (EMNLP-
     IJCNLP), Association for Computational Linguistics,
     Hong Kong, China, 2019, pp. 3615–3620. URL: https://
     aclanthology.org/D19-1371. doi:10.18653/v1/D19-
     1371.
[23] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
     pre-training of deep bidirectional transformers for
     language understanding, in: J. Burstein, C. Doran,
     T. Solorio (Eds.), Proceedings of the 2019 Conference
     of the North American Chapter of the Association for
     Computational Linguistics: Human Language Tech-
     nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
     June 2-7, 2019, Volume 1 (Long and Short Papers),
     Association for Computational Linguistics, 2019, pp.
     4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
     doi:10.18653/V1/N19-1423.
[24] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber-
     land, T. Reddy, D. Cournapeau, E. Burovski, P. Pe-
     terson, W. Weckesser, J. Bright, S. J. van der Walt,
     M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.
     Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Po-
     lat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde,
     J. Perktold, R. Cimrman, I. Henriksen, E. A. Quin-
     tero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,
     F. Pedregosa, P. van Mulbregt, SciPy 1.0 Contribu-
     tors, SciPy 1.0: Fundamental Algorithms for Scientific
     Computing in Python, Nature Methods 17 (2020) 261–
     272. URL: https://doi.org/10.1038/s41592-019-0686-2.
     doi:10.1038/s41592-019-0686-2.
[25] K. Ethayarajh, D. Duvenaud, G. Hirst, Towards un-
     derstanding linear word analogies, in: A. Korhonen,
     D. Traum, L. Màrquez (Eds.), Proceedings of the 57th
     Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguis-
     tics, Florence, Italy, 2019, pp. 3253–3262. URL: https:
     //aclanthology.org/P19-1315. doi:10.18653/v1/P19-
     1315.
[26] L. Gyongyosi, S. Imre, A survey on quantum com-
     puting technology, Comput. Sci. Rev. 31 (2019) 51–
     71. URL: https://doi.org/10.1016/j.cosrev.2018.11.002.
     doi:10.1016/J.COSREV.2018.11.002.
[27] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou,
     Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang,
     Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang,
     Z. Liu, P. Liu, J. Nie, J. Wen, A survey of large lan-
     guage models, CoRR abs/2303.18223 (2023). URL: https:
     //doi.org/10.48550/arXiv.2303.18223. doi:10.48550/
     ARXIV.2303.18223. arXiv:2303.18223.
[28] G. A. Miller, WordNet: A lexical database for english,
     Communications of the ACM 38 (1995) 39–41.
[29] T. Chklovski, P. Pantel, VerbOcean: Mining the web
     for fine-grained semantic verb relations, in: D. Lin,
     D. Wu (Eds.), Proceedings of the 2004 Conference on
     Empirical Methods in Natural Language Processing,
     Association for Computational Linguistics, Barcelona,


                                                                 90
6. Appendix
6.1. Abbreviation detection algorithms
During the preprocessing of papers, we expand abbreviations and map them to their long form. To compare the performance
of different abbreviation detection algorithms, we evaluate them on the paper Fundamentals of Generative Large Language
Models and Perspectives in Cyber-Defense, of which we have a thorough understanding [30].
   Table 4 shows the performance of the Schwartz-Hearst [15], scispaCy [16] and NLPRe [17] abbreviation detection methods.
The results show that the Schwartz-Hearst algorithm performs the best, though the scispaCy implementation has a similar
performance. However, the Schwartz-Hearst algorithm is much faster, so we chose this approach.

Table 4
Performance of three abbreviation detection algorithms on the paper Fundamentals of Generative Large Language Models and Perspectives
in Cyber-Defense [30].

                                  Correctly detected abbreviations    Falsely detected abbreviations   Processing time
              Schwartz-Hearst                       11                               1                 0.04 s
              scispaCy                              11                               4                 8.73 s
              NLPRe                                  0                               0                 0.01 s


6.2. Triplet extraction
Figure 3 below illustrates the number of triplets extracted per paper. The results indicate that for most papers, less than
30 triplets are extracted. However, the right tail is long, which shows that there are outliers for which over 200 triplets are
extracted. Table 5 shows the hyperparameters for the extraction of triplets for the arXiv papers with categories cs.AI, cs.CL
and cs.LG from December 2023.


Figure 3: Histogram of the number of triplets extracted per paper from the categories cs.AI, c.CL and cs.LG in December 2023


Table 5
Hyperparameter setting for the triplet extraction

                                                                              Parameter value
                                       Maximum length triplet component       6
                                       Threshold claim extraction             0.05
                                       Threshold for book corpus filtering    0.10
                                       Threshold for arXiv corpus filtering   10
                                       Threshold subject clustering           0.1
                                       Threshold object clustering            0.1


6.3. Triplet clustering
Figure 4 shows for each cluster size the fraction of the clusters that only contains triplets from one category of data and the
fraction that has triplets from both categories. The results show that smaller clusters more often tend to have triplets from
one category, whereas larger clusters are more often mixed.
   Figure 5 shows the dendograms for the clustering of the subjects and objects. We have chosen a cutoff of 0.10 for both, as
this maintained a high similarity for the subjects and objects within the same cluster.


                                                                 91
Figure 4: Fraction of the clusters with all triplets belonging to one category and fraction of clusters with triplets from both categories.


Figure 5: Dendrogram of the clustering of a subset of 350 subjects and 350 objects from the arXiv papers from the categories cs.AI, c.CL
and cs.LG in December 2023


6.4. Embedding inversion
Table 6 shows 8 examples of embedding inversions, where the 10 most similar tokens are presented. Furthermore, the distance
between the verbs (1 minus the cosine similarity) is shown. We find that the embedding inversions are not clearly interpretable,
as the top 10 embedding inversions do not reflect the differences between the predicates. The embedding inversions are either
similar to one of the two predicates, or are seemingly unrelated to both. Therefore, the embedding inversion cannot be used
to assess whether triplets are aligned or contradictory.


                                                                    92
Table 6
Manually selected triplets from the arXiv categories cs.AI, cs.CL, cs.LG from December 2023. The distance is defined as 1 minus the
cosine similarity between the verb embeddings.
Original triplets                                  Verb 1              Verb 2         Top 10 embedding inversions               Distance
                                                                                      ILLUSTRATES, ILLUSTRATING,
                                                                                      schematically, EXEMPLARY,
(example, illustrates, behavior),
                                                   illustrates    -    mimic          SUMMARIZES, DEPICTS,                      0.75
(example, mimic, behavior)
                                                                                      EMBODIMENT, ILLUSTRATED,
                                                                                      ILLUSTRATIVE, DESCRIBES

                                                                                      rigamarole, ERRAND, busywork,
(architecture, accomplishes, score),                                                  AFTERWORDS, canvasing, thigns,
                                                   accomplishes   -    achieves                                                 0.22
(architecture, achieves, score)                                                       harrasing, forementioned,
                                                                                      explaning, Busy-Work

                                                                                      PERFORMS, PERFORMING,
(type rnns, perform, baseline model),                                                 PERFORMED, PERFORM, SINGS,
                                                   performs       -    outperform                                               0.32
(type rnns, outperform, baseline model)                                               CONCERT, ACTs, SONG,
                                                                                      RENDITION, PLAYS

                                                                                      COMPLIANCE, COMPLY, INSUFFICIENT,
(subgradient method, not ensure, convergence),                                        IDENTIFIED, ENSURE, AUDIT,
                                                   not ensure     -    enjoy                                                    0.68
(gradient algorithm, enjoy, convergence)                                              INDICATED, NON-COMPLIANCE,
                                                                                      DETERMINES, IMPROPERLY

                                                                                      EXTRACT, EXTRACTS, DECOCTION,
(image representation, extract, concept),
                                                   extract        -    capture        TINCTURE, GINSENG, TURMERIC,              0.49
(image representation, capture, concept)
                                                                                      Comfrey, KOLA, ALOE, Stevia

                                                                                      INCURS,Accrues, ASCERTAINS, INCUR,
(language model, incurs, cost),
                                                   incurs         -    slash          INCURRING, incure, howsoever,             0.79
(language model, slash, cost)
                                                                                      INCURRED, internalizes, Indemnified

                                                                                      REPRESENTS, REPRESENTED,
(text, represents, knowledge),                                                        REPRESENTING, RepresENT,
                                                   represents     -    requires                                                 0.45
(text, requires, knowledge)                                                           ABSCISSA, symbolises, PERSONIFIES,
                                                                                      DEPICTS, symbolised, Respresents

                                                                                      DEMONSTRATES, demonstates, demostrates,
(knowledge transfer, demonstrates, improvement),                                      EXEMPLIFIES, Dissects, Elucidates,
                                                   demonstrates   -    not maintain                                             0.49
(knowledge transfer, not maintain, improvement)                                       DECONSTRUCTS, ILLUSTRATES,
                                                                                      explicates, EXPLORES


                                                                      93