1. Introduction

Dec

Harnessing Il Manifesto Newspaper Archive for Knowledge Base Creation: Techniques and Findings in the MeMa Project

Robert J. Alexander

Matteo Bartocci

Oriana Persico

Guido Vetere

2 3 0 Human Ecosystems Relazioni S.R.L , via Umberto Guarnieri 15, 00177 Roma , Italy 1 Il Manifesto Soc. Coop. , Via Angelo Bargoni 8, 00153 Roma , Italy 2 Isagog S.R.L. , Via Faà di Bruno 54, 00195 Roma , Italy 3 Università Guglielmo Marconi , Via Plinio 44, 00193 Roma , Italy

2023

02 2023 0000 0002

English.The historical archive of the newspaper “il Manifesto” is a valuable asset protected by the Italian Ministry of Cultural Heritage. The MeMa project aims to create an “intelligent archive” using AI principles, fostering collaboration and transparency. The platform, built around Apache Jena and open linguistic technologies, addresses the newspaper community's specific needs. This paper presents the platform's architecture, knowledge base construction process, and future directions, emphasizing journalism enhancements through AI while respecting “Il Manifesto”'s principles. Italiano.L'archivio storico del quotidiano “il Manifesto” è tutelato dal Ministero dei Beni Culturali. Il progetto MeMa mira a creare un “archivio intelligente” basato su una intelligenza artificiale che favorisce la collaborazione e la trasparenza. La piattaforma, costruita attorno ad Apache Jena e tecnologie linguistiche aperte, risponde alle esigenze specifiche della comunità del giornale. Questo contributo presenta l'architettura della piattaforma, il processo di costruzione della base di conoscenza e le direzioni future, discutendo il potenziamento del giornalismo attraverso l'intelligenza artificiale nel rispetto dei principi de “Il Manifesto”.

eol>AI in journalism Open linguistic technologies Knowledge graphs Newspaper community

1. Introduction

The historical archive of the newspaper “il Manifesto” is an asset protected by the Italian Ministry of Cultural Heritage as of particular interest 1. The archive includes a paper collection starting from 1971, and a digitized collection starting from the 1990s. The resource is now entrusted to the “Nuovo Manifesto Società Cooperativa Editrice”, which publishes the newspaper and its digital editions since 2013. The cooperative is committed to maintain and improve the archive, as well as to guarantee free access and digital consultation facilities to anyone interested in it 2. The digital archive, produced in diferent phases over the years, reflects the historical and technological evolution of the publishing sector. The database initially included 10,013 digitized files containing about 160,000 articles, with few gaps in the years 1985-1986 and 1994-2002. Il Manifesto considers an “intelligent archive” to be the cornerstone of its digital strategy, and for this reason seeks to align it with new technologies with appropriate investments in research and development. The MeMa (Memoria Manifesta) project started in 2020 by a partnership with Salvatore Iaconesi 3 and Oriana Persico, with the aim of developing new archive infrastructure based on Artificial Intelligence. This would be a “Community AI” [1] based on the principles of openness, transparency, collaboration and non-extractiveness, thus being able to establish productive relationships between the archive, the editorial staf, the user communities and society in general [2].

When, in 2023, the project was resumed, the new board decided to continue the original plan by making it evolve in the direction of Linked Open Data, and taking advantage of the latest advances in language and knowledge technologies. The idea was to build a standardsbased Knowledge Graph (KG) using editorial metadata and structured information extracted from article text. By itself, this idea is by no means new [3] [4] [5]. Also, there are commercial platforms that have been ofering solutions for the newspaper industry some years now, such as Neo4j [6] or Ontotext [7]. However, we realized that the success of the project depended significantly on how the platform would adapt to the way content is produced, extracted, organised, enriched and experienced by the professional and user communities gathered around

3Salvatore Iaconesi (Livorno 1973, Reggio Calabria 2022) has

been an engineer, artist, hacker and interaction designer the newspaper. Rather than forcing these habits to an out-of-the-box commercial platform, we opted to tailor a specific solution. Moreover, as a sociotechnical platform, MeMa should be open to user curation and contribution (e.g. from readers, archivists, and journalists), collaboratively contributing to the evolution of the AI, including correcting the inevitable errors of current NLP technologies. Hence, we started designing a custom platform around a core open graph database, namely Apache Jena 4 and a selection of open linguistic technologies suitable for the Italian language. The solution falls into the broad area of Enterprise Knowledge Graphs [8] which are gaining momentum as “rational counterparts” of generative linguistic technologies based on neural models [9]. This work is a first account of what emerged in the first months of analysis, design and development of the solution, and a discussion of our plans to meet the socio-technical requirements we have analyzed so far. Our contribution is a “reality check” of the use of knowledge and language technologies applied to complex texts produced by an Italian publishing community over more than 40 years of work. In general, our research concerns the interaction between digital systems and human beings to make their contents fully transparent and accessible to diferent user communities. From a linguistic point of view, relevant aspects include the specificity of the texts produced over a wide period of time, characterized by a specific idiolect but also by diachronic variations.

This paper is organized as follows. In Section 2, we present an architectural overview of the platform under development. Section 3 delves into the process of constructing the knowledge base, detailing the steps involved in gathering and organizing the relevant information. In Section 4, we discuss challenges and ideas about the future directions. Note that automatic content generation is not included in the journalism enhancements driven by AI, as intended by “Il Manifesto”.

2. System Overview

MeMa’s software architecture comprises several components that work together to handle a graph database with indexed attributes, enabling eficient ingestion, analysis, and semantic querying. The key components of this architecture include: 2. NLP Service: A REST service that provides an abstraction layer over various NLP functionalities to support the system’s operations. It wraps capabilities such as text analysis, entity recognition, topic analysis, semantic similarity, and other NLP tasks based on open source transformers [10]. This service collaborates with the ingestion process to extract valuable insights from the content being ingested. 3. Ingestion Processor: A batch process that is responsible for ingesting content into the KG. This process integrates diferent sources, analyzes texts to extract relevant information using the NLP service, and produces RDF sources to feed the KG according to the

MeMa ontology. 4. Query and Update Service: A REST service that is responsible for handling queries and update operations on the KG. It integrates similarity searches and SPARQL queries to retrieve relevant graph entities.

This service leverages the indexed attributes to optimize query performance and speed up retrieval operations, and the NLP Service to transform user’s queries and evaluate response ranking.

This software architecture employs a services and APIbased approach, enabling functional evolution, flexible deployment, and seamless scalability. The service architecture is an abstraction of a general functionality that can be applied to a variety of scenarios. Based on this design, we have developed custom application services that can be used in a front-end designed for the editorial staf of the newspaper. “Il Manifesto” has a print edition and an online edition, 1. Knowledge Graph: The core of the system is a graph each managed by its own Content Management System database of the RDF (Resource Description Frame- (CMS). The two editions largely coincide, however each work) family with inference capabilities, based on one may contain articles not present in the other. As a Apache Jena, the Pellet OWL reasoner, the search en- result, the same article (with slight variations) may be gine Lucene, and custom components, where a num- available in two diferent repositories. When consolidatber of KG attributes are indexed and embedded to ing all editorial content into one Knowledge Base, we optimize search and retrieval operations. had to harmonize and integrate the contents from both

CMSs.

3. The Knowledge Base

Modeling editorial content in a KG requires the adoption of a suitable ontology. Although editorial content modeling has already been studied and tested [11], we did not identify a simple, well-established model that suited our needs. In particular, we aimed to represent how agents interpret specific tokens as referring to entities based on established conventions or procedures. In other words, we were interested in semiotics. At the best of our knowledge, even comprehensive conceptualizations, like the CIDOC Conceptual Reference Model [12], which include linguistic and symbolic objects, do not provide modeling primitives to represent interpretation processes. This is why we decided to develop our own conceptualization, which we will illustrate in the following section. Mappings to existing conceptual frameworks, such as schema.org5, are preserved as annotations.

3.1. The MeMa Ontology

The MeMa ontology focuses on the way entities are mentioned, rather than on the characterization of those entities, which is mostly left to external sources. As such, the MeMa ontology adopts a semiotic perspective [13] in the line of [14] and [15]. The structure of our ontology is sketched as follows: • Class: Sign

An immaterial entity that stands to someone (or something) for some other entity as the outcome of an interpretation – Subclass: Category

A sign standing for a class of entities – Subclass: Reference

A sign standing for a single (even collective) entity – Subclass: Topic

A sign standing for a focus of interest in a larger context • Class: Information

An immaterial thing that conveys interconnected signs – Subclass: Text

A textual information object – Subclass: Sentence

Part of a text – Subclass: Token

Part of a sentence • Class: Entity

A spatio-temporal thing – Subclass: Agent

An entity that has the capacity to initiate or perform actions – Subclass: Location

An identified portion of space – Subclass: Event

An entity that unfolds in time – Subclass: Object

An entity that unfolds in space

A key feature of this ontology is the distinction of Reference and Token, where the latter instantiates the former 6. As a Sign, a Reference is based on an interpretation process, whether human or automated, e.g., for DBpedia Spotlight, interpreting the string “Aristotle” as the name of the philosopher from Stagira. Sign instances support properties (interpretation records) that keep track of these processes. A Token, on the other hand, is a portion of Text, e.g. the string “Aristotle” that appears in a document at a given ofset, which may trigger the processes mentioned above. In this way, the semantic qualification of the text is provided with the means to trace the underlying interpretation, be it automatic or human. This is essential for ensuring the traceability and accountability of the knowledge base’s content.

3.2. Handling Metadata

Extracting knowledge from newspaper articles essentially consists of working on the both metadata and text in a consistent way. This process has currently generated about 650.000 stored articles and grows roughly by 1000 new articles a month.

6This aligns with Peirce’s distinction of type and token

According to our ontology, assertions about articles are based on two types of properties, which we call editorial and semantic. The former includes attributes such as publication date or author, the latter are generically intended to characterize the content, including standard categorization (sports, business, etc.), references to people, places and other named entities, and arbitrary classifiers which are typically encoded in freely invented wording. However, this distinction is neither fully aligned with the structure of the legacy metadata schemes, nor fully reflected in how metadata are actually produced. For historical and organizational reasons, in fact, the online and print editions are metadated separately, with diferent schemes and guidelines. Looking into it, we realized that integrating them could not be done by simply mapping schemes to our ontology, but instead required a thoughtful analysis of the actual data. We carried out qualitative and quantitative analyses which led us to devise an adequate treatment of the metadata content. Here is a summary of the historical archive scheme: of the legacy metadata schema and instead focus on the annotation content. In particular, with respect to our ontology, we want to distinguish among classifiers (Sign) and descriptions (Information). To this end, we use: • Two handcrafted tagsets, for editorial marks and standard topics respectively, obtained by clearing and deduplicating the contents of ARGOMENTO, CATEGORIA and the most recurrent RIFERIMENTI • A lemmatizer for out of tagset values • A rule-based classifier for multi-word RIFERIMENTI values, which discriminates descriptions from multiword topics

Classifiers are instantiated as either as Category or

Topic, and suitably linked to the article, while descriptive summaries are kept as data properties, whose content is indexed. We plan to add a vector representation of summaries to include them in semantic similarity searches and/or clustering.

3.3. Knowledge Extraction • ARGOMENTO (subject) is fed with labels with no semantic relationship amongst them. The raw count Besides annotated metadata, MeMa analyzes the full arfor these labels is 792.000 with 4023 distinguished ticle text. At the current stage, we only perform entity values (0.51%), which comprise synonyms, typos, ab- recognition and linking. There are no limits to the kind breviations, and other variants. of entities that can be mentioned in a newspaper arti• CATEGORIA (category) field, on the other hand, is cle. However, there are limits to the kinds that can be eudsietdorwialitehtca) pburetvaaglaeinncweeofofetdenitoernicaolutangtesr(vfraolunetsptahgaet, eficiently retrieved by standard NLP pipelines. One of also belong to the ARGOMENTO field. The raw count the richest known inventories [16], includes up to 18 usage for CATEGORIA is 828.805, with 1358 diferent categories, but as a matter of facts the available recognizvalues (0.16%), which also comprise synonyms, typos, ers for the Italian language, e.g. Spacy [17] and Stanza abbreviations, and other variants. [18] are limited to just a few of them, such as PER(son), • LOCALITA (location) accommodates editor’s or LOC(alization), and ORG(anization). We currently use archivist description of what geopolitical entities are a combination of Stanford’s Stanza [18] (in particular: involved. They might not be mentioned literally in tokenize, mwt, pos, lemma, depparse, and the article. We observed redundant tagging where ner processors), DBPedia Spotlight [19], GeoNames 7, many broader geopolitical concepts, which could be along with a number of custom processing functions. inferred, are explicitly stated somewhat arbitrarily We choose Stanza because of the state-of-the-art perfor(e.g., CUTRO, CR, Italia). Whenever we successfully mances on Italian benchmarks8. We evaluated the NER ldiannkcay gbeeocpoomlietsicuanlnmeecnestisoanryt,oasGGeoeNoNamamese,sthalilsorwesdufonr- performance on our sources by randomly choosing 30 arfull hierarchical navigation. ticles, manually annotating their content, and matching • RIFERIMENTI (references) is used as a placeholder the pipeline outcome. Results presented in Table 2 align for a variety of annotations, which also overlap other with the current state of the art [20]. ifelds. Most often, these are short summaries which For the PER class we also adopt a simple coshould facilitate keyword based retrieval. We cur- referencing matching based on the fact that within an rently count 949248 occurrences of these annotations, article we mostly find a fully named instance of the per679760 of which are unique (71,6%), thus qualifying son and subsequently only the first or last names. Along by far as the most informative facet. with the span, we therefore generate a Person co-refernce Overall, the frequency distribution of all these proper- ID. We then proceed to the grounding attempt against ties exhibits long tails with low frequencies typical of a the DBpedia API which we invoke via its Spotlight funclack of annotation guidelines and tools. In particular, the tion. We have found no added precision/recall by giving RIFERIMENTI field appears to be very heterogeneous, it more textual context. For both the grounded and the as it mixes editorial tags (e.g. breve, cronaca), named entities and content summaries. As a result of this analysis, we decided to ignore the formal meaning (if any) 7https://www.geonames.org/ 8Stanza’s performance on NER Corpora https://stanfordnlp.

github.io/stanza/ner_models.html annotation breve (short) cronaca (news) analisi (analisys) programma (program) scheda (form) crisi (crisis) scenario (scenario) le lettere di oggi (today’s letters) storia (history) ritratto (portrait) campagna elettorale (election campaign) reazioni (reactions) famiglia incertezza e preocupazioni (sic) (family uncertainty and worries) oggi sciopero marcia globale per il clima (global climate march strike today) giorgio forti, alessandro stoppoloni, christian picucci (proper names) ungrounded PERsons, we then store the span of surface, a fuzzy score of the match with DBpedia’s entity to accommodate typos and variations which are especially common with the Italian rendition of foreign names and the reference to the current article. We therefore have the spans where the surface of the person was mentioned and the grounded/ungrounded reference to the article in a separate collection. A similar process is performed for the LOCation named entities against the GeoNames resource. Linking to the GeoNames resource gives us a wealth of added information amongst which geolocalization and administrative and geographical data. Also for LOC we store the spans within the article’s and the mentions in their dedicated collection. We also tried using DBpedia Spotlight for ORGanizations but the results were not satisfactory. One of the causes may be the lack of precision at the NER stage. Also, there are often false positive groundings given that there are several organizations with namesakes or placenames. We didn’t conduct a comprehensive analysis of the entity linking performance; however, an initial examination revealed that roughly 10% of the total links were incorrect. Finally, the last stages of our pipeline transforms the staging data into corresponding RDF data (Turtle format). We therefore generate article individuals with metadata from both the historical and the digital corpora leveraging the reconciliation when possible and we also generate individuals, topics and all of their cross-linked mentions. The resulting knowledge base is currently expressed with approximately 12.5 million triples, and loaded into Apache Jena Fuseki to be used as a SPARQL endpoint.

4. Challenges and Ideas

Newspaper articles pose several interpretative challenges [21]. The reporting of events, with their participants and their contextual characterization, are the most relevant parts of their content. Metonymy, regular polysemy and presupposition, even combined, stand out as prominent linguistic phenomena. Take for instance the headline: “Di Maio al Colle, ma non da Mattarella” (≈ “Di Maio at the Colle, but not meeting with Mattarella” ) 9. “Di Maio” and “Mattarella” can be plainly identified as person mentions and linked to their corresponding individuals (Italian politicians). But what about “Colle”? Even if it were identified as a place (the Quirinal hill in Rome) it is clear that, contextually, the token intends to signify the institutional function of the presidency of the Italian Republic. Also, the people mentioned in the sentence represent their public roles at the time the article was written, rather than any identified human being. This kind of metonymic use of language makes classification of named entities more dificult [ 22]. As for the news

9https://ilmanifesto.it/di-maio-al-colle-ma-non-da-mattarella

in question, note that apparently there is no mention of as the headline in question), and their participants, along any event, but presumably something happened. Event with some other contextual element, can be reliably idenmining is also a long-standing challenge of NLP, as well tified even with little superficial evidence. The LLMs as reasoning about implicature and presupposition [23]. generative ability of “connecting the dots” seem to be These tasks are usually approached with ML methods particularly efective when dealing with journalistic jar[24]. In particular, supervised learning strategies have gon, which is actually full of elliptical constructions. As been implemented in recent years, but they are limited in for lexical units other than entities and events, framing that they require specific annotated corpora and training complex notions such as not receiving instructions in a procedures [25]. Knowledge Graph may raise ontological challenges, e.g. The use of Large Language Models (LLM) with generative in this case that of representing negative facts. The “ontotransformers are now regarded to as the way forward for logical cut-of” operated in the design phase, i.e. the way the challenges mentioned above [26], so we plan to re- in which linguistic and logical (conceptual) expressiveplace part of the existing NLP machinery with a suitable ness is arranged, plays here a crucial role. Our ontology is implementation of these techniques. At present, we are such that only basic patterns (e.g. participation in action) piloting publicly available services. For instance, if suit- are ingested into the KG as logic assertions (i.e. triples), ably promped, GPT4 [27] produces the following analysis while blurry concepts (e.g. receiving instructions) are of the headline mentioned d above: kept at the lexical level. Lexical concepts can be mapped to onto-lexical resources and interleaved by semantic 1. Di Maio: This refers to a person, likely Luigi Di Maio, relationships, as well as associated to distributional emdainscIutassliioann oprotlhiteiccioannt.exHteincwouhlidchbtehethseenstuebnjceectisoufstehde. beddings. In any case, the “ontological cut-of” requires 2. at the Colle: Refers to “Colle” or “Quirinale”, which the division of KG’s reasoning into logical and linguistic is the oficial residence of the President of the Italian inference procedures and the integration of their results, Republic, located in Rome. Therefore, “Di Maio at the which is at the core of our future developments. The curColle” indicates that Luigi Di Maio is at or going to the rent prototype does not include semantic relationships Quirinale Palace. and deep linguistic inference, but we do evaluate seman3. Mattarella: Refers to Sergio Mattarella, who was the tic similarity based on embeddings of textual fragments President of the Italian Republic as of September 2021. (e.g. headlines and summaries), e.g. when re-ranking KG He might be mentioned because he is the incumbent queries results. president at the time the sentence is spoken. To improve knowledge extraction, we are in the process of experimenting LLMs generative models. It is alLlama2 [28] gives similar results 10: ready clear, however, that for giant models available only 1. Di Maio: This refers to Luigi Di Maio, an Italian politi- through remote services, such as those of the OpenAI cian who served as the Minister of Foreign Afairs and family, the feasibility of these experiments could be probInternational Cooperation from 2018 to 2019. lematic, since the stability of their behaviour seems to be 2. Colle: This is a reference to the Quirinal Hill, a hill in questionable [29]. Also, the use of remote services would

Rome, Italy, where the President of Italy resides. not comply with Il Manifesto’s digital strategy, due to un3. Mattarella: This refers to Sergio Mattarella, the President wanted bindings to external business entities. Therefore, of Italy from 2015 to 2022. we are focusing on the use of on-premise open LLMs, trading some functionality for dependability, freedom, control, and cost efectiveness. At the time of writing, although the use of open models such as LLama2 seems promising, we have identified some hallucinations, for example the person “Matteo Meloni”, erroneously identiifed as reference for “Meloni” in the context of “governo Meloni”, who looks like a disturbing hybridization of the current Italian Prime Minister and his Deputy. How to deal with invented entities and fancy judgments is a general concern for the productive use of these new NLP methods.Our approach will be to involve editors, archivists and readers in reviewing and amending AI results.

In both cases, entities are correctly identified and connected to relevant background knowledge, where their respective professional role are also highlighted. When it comes to implicatures, GPT4 is pretty inventive:

So, the sentence could mean that Luigi Di Maio is going to or present at the Quirinale, but he is not receiving instructions or direction directly from Sergio Mattarella. It could be used in a political or governmental context to express a situation where Di Maio is acting independently of the President of the Republic.

Llama2 seems to be less imaginative:

Therefore, the entities mentioned in the phrase are two politicians (Luigi Di Maio and Sergio Mattarella) and a geographic location (Quirinal Hill) These examples show how, using LLMs appropriately, events can also be found in nominal constructions (such 10We are using the 13B parameters deployed on a virtual host

5. Conclusion References

The construction of MeMa’s KG is an opportunity to discuss the state of the art perspective of NLP in the context of a real Italian content production environment. The KG will be made available later this year through a SPARQL endpoint and a dataset collection. At the current stage, our experience shows the potential, but also the limits, of NLP technologies applied to a large corpus of newspaper articles extended over a relevant time interval, which are characterized by a sophisticated use of the Italian language. In general, structured knowledge extraction can be achieved with various levels of granularity by integrating NLP processors, such as named entities recognizers, event recognizers and role labelers, keyword and topic extractors. Pre-trained multilingual LLM-based generative transformers will probably replace the supervised methods that have dominated the technology of these processors the last decade, considerably easing the task of extracting qualified semantic information. However, the new neural technologies do not seem free from errors, mainly due to the kind of inventive linguistic generation that may produce. Giving the user community the ability to “educate” AI, i.e. monitor and correct its results, remains the main route for us. Transparent logical structures such as Knowledge Graphs ofer the best support for this type of activity. How information automatically extracted from text can be conceptualized and critically scrutinized by user communities will have a profound impact on the harmonization of AI in human ecosystems.

Linguistics: System Demonstrations, Association

for Computational Linguistics, 2020, pp. 272– 277. Https://www.aclweb.org/anthology/2020.acldemos.34. [19] P. N. Mendes, M. Jakob, A. García-Silva, C. Bizer, DBpedia Spotlight: Shedding Light on the Web of Documents, in: Proceedings of the 7th International Conference on Semantic Systems, ACM, 2011, pp. 101–108. URL: https://dbpedia.org/spotlight. [20] S. Vajjala, R. Balasubramaniam, What do we really know about state of the art ner?, in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association (ELRA), Marseille, 2022, pp. 5983–5993. Conference held on 20-25 June 2022. [21] T. A. van Dijk, News as Discourse, Lawrence Erlbaum Associates, 1988. [22] K. Markert, M. Nissim, Semeval-2007 task 08: Metonymy resolution at Semeval-2007, in: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association for Computational Linguistics, 2007, pp. 36–41. [23] P. Jeretic, A. Warstadt, S. Bhooshan, A. Williams, Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8690–8705. doi:10.18653/v1/2020.acl-main. 768, https://aclanthology.org/2020.acl-main.768. [24] Q. Li, J. Li, J. Sheng, S. Cui, J. Wu, Y. Hei, H. Peng, S. Guo, L. Wang, A. Beheshti, P. S. Yu, A survey on deep learning event extraction: Approaches and applications, IEEE Transactions on Neural Networks and Learning Systems 14 (2022) November 2022. doi:10.1109/TNNLS.2022.xxxxxxx. [25] K. A. Mathews, M. Strube, A large harvested corpus of location metonymy, in: International Conference on Language Resources and Evaluation, 2020. [26] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, G. Wang, Gpt-ner: Named entity recognition via large language models, 2023. arXiv:2304.10428. [27] OpenAI, Gpt-4 technical report, 2023.

arXiv:2303.08774. [28] H. Touvron, al., Llama 2: Open foundation and finetuned chat models, 2023. arXiv:2307.09288. [29] L. Chen, M. Zaharia, J. Zou, How is chatgpt’s behavior changing over time?, 2023. arXiv:2307.09009.