1. Introduction

Journalistic Metamorph

10.1145/3447772

Developing a Software Reference Architecture for Journalistic Knowledge Platforms

Marc Gallofré Ocaña

Andreas L. Opdahl

0 0 University of Bergen , Bergen , Norway

2021

2699 13 17

For news organizations to survive and thrive in today's media landscape, they must utilize big data and artificial intelligence technologies efectively. News organizations that want to exploit techniques like machine learning and knowledge graphs for big data may manage to use them independently, but struggle to get everything to work together. A software reference architecture would help by providing a generic blueprint and capturing the tried-and-tested best practices for designing and implementing concrete solutions but, to the best of our knowledge, no suitable architecture has been proposed. This paper therefore outlines a software reference architecture for digitalization of newsrooms, along with a proof-of-concept of the architecture.

eol>Software Reference Architecture Integrated Neural-Symbolic AI Knowledge Graph Newsroom

1. Introduction

complex systems that must balance many concerns [7] and are thus challenging to adopt without architectural News organizations must constantly adapt their business guidance; (b) must interoperate with a wide variety of models to digital media innovations to improve informa- in-house legacy systems and external services, including tion quality, competitiveness and growth. This changes other JKPs [8]; and (c) are long-term investments that how journalists and readers interact with news content must be able to evolve to incorporate future best-of-breed and background information [1]. News agencies use big components that replace or come in addition to existing data and artificial intelligence (AI) for diferent journ- ones [7]. On the technical level, JKPs also need to support alistic purposes [2] that include: identifying and con- (a) the ingestion of real-time news items from multiple textualizing newsworthy events for investigated journ- sources of unstructured and semi-structure data which alism; facilitating data visualization in digital journal- must be semantically annotated and represented (viz., ism; automating news writing in robot journalism (a.k.a. lifted) and enriched in the knowledge base [9]; (b) the algorithmic journalism or automated journalism); and production of potentially newsworthy events which are providing real-time fact-checking tools for political journ- continuously pushed to journalists [10]; (c) the diferalism. Journalistic Knowledge Platforms (JKPs) integrate ent services for pulling information from the knowledge these and related techniques into knowledge-centric big- base [11]; and (d) mechanisms for continuously evolving data platforms. They are an emerging type of complex in- and adapting machine learning models and ontologies formation system that integrates state-of-the-art Neural- for curating and enriching the knowledge base [12]. Symbolic AI techniques [3] such as machine learning A software reference architecture (SRA) “is a generic (ML), semantic knowledge representations and natural architecture for a class of systems that is used as a foundlanguage processing (NLP) to support daily processes in ation for the design of concrete architectures from this newsroom workflows [ 4, 5]. In our research, we are focus- class” [13]. It defines the basic software elements and ing on JKPs that employ semantic knowledge graphs [6] data flows that implement the functionalities of, and capfor representing knowledge. tures the best practices for designing and implementing

It is challenging for news organizations to evolve the complex information systems like JKPs. We can distinmany independent and task-specific systems they run guish two types of SRA: practice-driven and researchtoday into cohesive and comprehensive JKPs [5]. On the driven [14]. Practice-driven SRAs are based on practical management level, central challenges are that JKPs (a) are experience developing concrete architectures in a domain. They describe the “best-practices” and address legacy problems. Research-driven SRAs are designed for a class of systems where there are no experiences on developing them and are expected to become relevant in the future. They are based on the related research experiences. To a news organization, an SRA would bring a blueprint along with advice for how to evolve its current

3. Related Work 3.1. Journalistic Knowledge Platforms

news production systems to become an integrated JKP.

Researchers have proposed several software architectures to deal with big data [15]. But most of them did not consider mechanisms for semantically representing and enriching heterogeneous data, continuously pushing live streams of data to the users, curating knowledge bases, and maintaining machine learning models. Hence, none of them are suitable for JKPs, or even for being adapted to the JKP domain. Therefore, in this paper, we address the question: “What would be a good software reference architecture for journalistic knowledge platforms?”. To address this research question, we followed a design science approach [16] to develop a research-driven SRA for JKPs along with a proof-of-concept prototype of a JKP platform. To guide development we identiefid opportunities and challenges for JKPs from the literature [5], relied on our experience from developing previous JKP prototypes [17, 18], and collaborated with a software developer for the international newsroom market [17].

The rest of the paper is organized as follows: Section 2 summarises the research method; section 3 analyses the related literature, section 4 outlines the concerns for an SRA, section 5 introduces the SRA for JKPs, section 6 describes the implementation of the proposed SRA, and section 7 states our conclusions and further work. Lambda processing architecture [28] to RDF1 for gather- C5 must provide diferent services for pulling informaing, storing and serving big data in real time. Bolster [15] tion from the knowledge base [11]; extended the Lambda architecture by adding a new semantic layer to represent machine-readable metadata, C6 must continuously evolve machine learning models contrary to JKPs that represent the data semantically. for enriching the knowledge base; However, none of them cover mechanisms for enriching C7 are knowledge-centric systems that provide knowsemantic data, continuously pushing live streams of data ledge representations of news, events and backto users, and keeping machine learning models up-to- ground information from the real-world which is date. Pääkkönen and Pakkala [29] proposed an SRA for constantly changing [12]; machine learning in edge computing environments that considered updating and maintaining of machine learn- C8 contain diferent databases for specific purposes like ing models, but did not consider semantic technologies. multimedia files, historical archives and real-time

The Lambda architecture has been criticised for its news feeds [7]; design: data and code are duplicated in two layers, namely C9 must facilitate schema evolution [12]; speed and batch layer, and data requests need to be coordinated between both. This increases development, C10 need to consider privacy, provenance, terms-of-use implementation, maintenance eforts, and hardware de- and data quality [8]; mands [30, 31]. The Kappa processing architecture was therefore proposed [30]: it removes the batch layer, only C11 support news production where time is a critical deals with real-time computation, and provides a single factor and delays can lower the value of informadata view which is only changed when the code is up- tion [8]; dated and the old view is recomputed. However, the batch C12 and must consider big data properties such as data jobs are not clearly defined and if needed, they have to heterogeneity, volume and velocity [10]. reprocess the current data view [31]. To overcome this challenge, Cerezo et al. [31] proposed the Phi processing In addition, a good SRA must satisfy the usual requirearchitecture, which is inspired by Lambda, but delays the ments of being feasible, representative, essential, easy to data stream replication after the real-time computation grasp, long lasting, and technology independent [13, 33]. is done and provides a single data view. We return to discuss these concerns in Section 5.

To the best of our knowledge, no suitable SRA for JKPs has been proposed. Our proposed SRA for JKPs is 5. Software reference architecture inspired by the Phi processing architecture to overcome the challenges of the previous big-data architectures. It is for JKPs focused on representing and enriching data semantically and keeping up-to-date machine learning models, along with the identified domain specific concerns of JKPs.

The proposed SRA for Journalistic Knowledge Platforms

(Fig. 1) is organized as five core services: the Ingestor, Knowledge Base, Curator, Feeder and Retriever. Each service comprises several micro-services, each of which is 4. Concerns for an SRA designed as an independent component with a clear API facilitating its replacement and integration, making it We identified the following main concerns from previous easy to scale and distribute [34] (C1, C2). Solutions like studies [5, 32] and the analysis of similar systems [8, 7, 9, Docker2 can be used to improve the availability and re12, 11, 10]: placeability of micro-services.

C1 JKPs must interoperate with heterogeneous in-house The SRA data flow and processing steps are inspired legacy systems, external services, and other JKPs [8];by the Phi architecture [31], which is designed for big data and delays the downstream processing as much as C2 must be able to incorporate future components that possible after the real-time processing (C11, C12). The replace existing ones [7]; real-time processing step applies the data transformations once and for all near the sources, providing and C3 must ingest real-time news items from multiple het- combining both knowledge representations and unstrucerogeneous sources that must be semantically tured data. Hence, the SRA facilitates the integration of lifted [9]; neuro-symbolic AI by provisioning both types of data C4 must produce potentially newsworthy events which from the beginning.

are continuously pushed to journalists [10]; To eficiently process data streams, JKPs must exploit concurrent and parallel processing. For example, JKPs

1www.w3.org/TR/rdf-schema 2www.docker.com

can integrate solutions like Apache Spark3 for batch and Apache Storm4 for real-time processing. To guarantee the message distribution along the diferent services, JKPs can employ message systems like Apache Kafka 5, and serialize the messages using JSON-LD6, an extension of JSON for representing linked data. a result, the Ingestor processes the real-time transformations once and near the source before they are stored in the Knowledge Base, and further processed by the Feeder and Curator.

5.2. Knowledge Base

5.1. Ingestor The Knowledge Base provides persistent storage for source items and knowledge representations. It is composed of The Ingestor is an Extraction-Transformation-Loading (ETL) multiple instance of dedicated databases for multimedia service where a Harvester continuously downloads and and text files such as the Source Texts service. It coningests scheduled and real-time news, social media mes- tains the Knowledge Graph for representing news-related sages and multimedia items from sources like RSS, APIs knowledge (C8), which we designed to provide a single and web-sites (C1, C2 C3). Additional services such data view that constantly changes to capture real-world as the Translator and Filterer pre-process and clean the evolution and avoid unnecessary delays (C9, C11). The downloaded items, for example, translating text into a ca- Knowledge Base also provides a middleware to internonical language, normalizing data types, standardizing act with the diferent repositories and integrate legacy formats, and filtering advertisements. Lifters then annot- archives (C1). ate and transform the news-related items into knowledge- These storage services must be prepared to handle graph representations in real time (C7) before they are up- large volumes of data and intensive write operations loaded to the Knowledge Base. These knowledge graphs in real time (C12). For example, distributed databases are represented in RDF following predefined ontologies like Apache HBase7 and Cassandra8 can store large data like the Event Description Ontology [35] (C3), made gen- volumes. Although many of the open source graph dataeral to facilitate schema evolution (C9) and data exchange bases and triple stores with support for RDF and SPARQL (C1). are not distributed, some of them can hold more than

The Lifter is composed of AI modules. which we de- one billion (109) triples [39] (e.g., Blazegraph9 and Jena signed to be replaced or extended to follow the state- TDB10). Strategies to manage graph database size include: of-the-art [36] (C2). To combine the results from the (a) partitioning the graph database according to resource diferent AI modules, these can use NIF [ 37] or NAF [38] types/predicates, themes or geolocations, or a combinvocabularies to standardize their annotations. Every an- ation of these; and (b) reducing the data stored in the notation must provide quality information (e.g., accuracy Knowledge Graph by only keeping the knowledge repand support values) and provenance to trace back the resentations and storing the textual and multimedia inannotations to the process that generated them (C1). As formation in dedicated databases.

3spark.apache.org 4storm.apache.org 5kafka.apache.org 6www.w3.org/TR/json-ld11

7hbase.apache.org 8cassandra.apache.org 9www.blazegraph.com 10jena.apache.org 5.3. Curator The Curator provides a diverse set of functionalities, com

prising services with diferent behaviors. It improves the data in the Knowledge Graph, produces new representations for diferent journalistic purposes ( C7), and updates AI/ML models (C6). For example, the Enricher enhances the Knowledge Graph representations using external information from the LOD (e.g., Wikidata); the Model Updater keeps up-to-date and re-trains AI/ML models; the Event Detector and the Network Analyzer use information from the Knowledge Base to identify events and networks of actors; and the Licensing and Privacy Managers propagate prohibitions, permissions, and obligations to lfag and rectify violations ( C10).

Curator services can be analyzed from diferent perspectives: what triggers the service, what information is used, and the main information flow. Curator services such as the Enricher and the Event Detector continuously ingest lifted items from the Knowledge Base, and can use information from external and internal sources respectively to generate new knowledge. Other services such as the Licensing and Privacy Managers outputs alerts that need to be addressed by the user. Others such as the Model Updater and the Network Analyzer are triggered periodically. For example, the Model Updater can be automatically triggered hourly or even more frequently to adapt real-time models to the current events using the most recent data, and it can also be triggered daily or even weekly to re-train models to follow current developments. The Model Updater can access any data in the Knowledge Base, having access to both semantic representations and raw sources. Since every semantic representation must reference the source of its annotations, it facilitates the generation of training data.

5.4. Feeder

The Feeder continuously monitors the streams of linked data fragments coming from the Ingestor to push realtime news information to the user (C4, C11). For example, it provides news feeds to users, suggests emerging stories, identifies trends, and allow journalist to follow current stories or developments.

5.5. Retriever The Retriever lets users pull information from the JKP

on demand. It provides a front-end to access, visualize and explore the Knowledge Base and its services (C5). For example, the Retriever can provide an end-point with pre-packaged SPARQL queries for particular purposes, like finding news stories related to a particular person or retrieving relevant background information for a given news event.

6. Prototype We are developing a prototype that instantiates the SRA

for JKPs (Fig. 2). The current version implements central parts of the Ingestor (Harvester and Lifter), Knowledge Base (Source Text and Knowledge Graph), Curator (Enricher and Event Detector) and Retriever (Pull API) services. We used Docker Swarm for orchestrating and replicating the services. The platform runs on 17 cloud instances (with total of 38 vCPU, 152GB RAM and 20TB disk). The messages are serialized using JSON-LD and passed between services using Kafka. Previous prototypes running on a simpler infrastructure [17] have already implemented additional parts of the Feeder, Curator and Retriever, which we plan to adapt into the current prototype.

We designed the Ingestor with services for Harvesting and Lifting. Our Harvester crawls news-related websites and harvests RSS feeds, Twitter accounts, NewsAPI11 and GDELT12. The Twitter API provides real-time tweets streams from specific accounts, geographical areas or topics. NewsAPI aggregates and provides streams of news articles from over 50000 news sources and blogs.

GDELT provides semi-structured information about conlfict events from news all around the world, that have been automatically translated to English from 65 diferent languages. Our Lifter [36] transforms the news and event streams in real time into semantic knowledge representations following the Event Description Ontology [35]. It employs out-of-the-box NLP systems such as DBpedia Spotlight13 and SpaCy14 for semantically annotating textual items and linking them to Wikidata and DBpedia.

To integrate their outputs, we use NIF because it is the only vocabulary based on RDF and OWL for describing NLP annotations. The Knowledge Base includes a Source Texts service implemented with Apache Cassandra and a Knowledge Graph implemented with Blazegraph. Cassandra has been used to store the textual information together with the IRIs of the news items represented in the Knowledge Graph. This decision allowed us to reduce the data stored in the Knowledge Graph; provide provenance by tracking news representations back to its source; and facilitate new training material for ML models based on the current state of our system. The Curator implements an Enricher and Event Detector. Our Enricher extends the lifted items with location-related background information extracted from DBpedia. Our Event Detector provides journalists with real-time events detected from GDELT streams. The Retriever exposes an API for pulling lifted news items from the Knowledge 11newsapi.org 12www.gdeltproject.org 13www.dbpedia-spotlight.org 14spacy.io

Base for a research challenge task15, allowing external summarize the mapping between the discussed concerns users to interact with our system. and the architecture patterns and core services.

7. Conclusion

We have proposed an SRA for JKPs to support the adop- Architecture pattern Concern tion of JKPs in newsrooms. Our architectural decisions were based on reported experiences with existing plat- Micro-services C1,C2 forms, supported by our own experience developing a Phi architecture C11,C12 JKP in collaboration with industry. The SRA supports implementing, maintaining, and evolving JKPs. It represents those components and functionalities that are essen- Table 2 tial for JKPs and provides a vocabulary to compare and Mapping between core services and concerns understand diferent JKP realisations. We designed the SRA to be technology independent, open-ended and long- Service Concern lasting, with components and services that can evolve Ingestor C1,C2,C3,C7,C9 or be replaced and integrated with legacy systems. The Knowledge Base C1,C8,C9,C11,C12 SRA considers the integration of AI components, the Curator C6,C7,C10 evolution of ML models, and provides both knowledge Feeder C4,C11 representations and unstructured data together to facilit- Retriever C5 ate the integration of neuro-symbolic AI. We have also presented a proof-of-concept implementation of a JKP that instiantiates the proposed SRA.

Because JKPs handle big data, they need a processing 8. Further work architecture that minimizes delays and maximizes information flow. Thus, the SRA is inspired by the Phi We plan to evaluate the correctness and the utility of architecture and provides a single data view through the the proposed SRA, being this part one of the main chalKnowledge Base. This reduces unnecessary data duplic- lenges of designing an SRA [40, 41]. Hence, we plan to ation, does not require coordination between diferent map existing JKPs from the literature into our proposed data views, and makes information available as soon as it SRA to verify that it covers the essential components of arrives in the Knowledge Base. To reduce delays further, JKPs and evaluate its correctness. We also want to conthe SRA also includes a component for pushing streams duct a qualitative evaluation of the SRA with component of newsworthy events to journalists. developers to validate its feasibility, understandability

In Section 5 we have discussed how the proposed SRA and utility. This qualitative evaluation would be comimplements the identified main concerns. Tables 1 and 2 posed by questionnaires inspired by the ISO/IEC 25000 standard and a series of interviews. Besides, we plan to extend our prototype to facilitate empirical evaluation of the SRA instantiation. Possible empirical paths include: