<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journalistic Metamorph</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3447772</article-id>
      <title-group>
        <article-title>Developing a Software Reference Architecture for Journalistic Knowledge Platforms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc Gallofré Ocaña</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas L. Opdahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bergen</institution>
          ,
          <addr-line>Bergen</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2699</volume>
      <fpage>13</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>For news organizations to survive and thrive in today's media landscape, they must utilize big data and artificial intelligence technologies efectively. News organizations that want to exploit techniques like machine learning and knowledge graphs for big data may manage to use them independently, but struggle to get everything to work together. A software reference architecture would help by providing a generic blueprint and capturing the tried-and-tested best practices for designing and implementing concrete solutions but, to the best of our knowledge, no suitable architecture has been proposed. This paper therefore outlines a software reference architecture for digitalization of newsrooms, along with a proof-of-concept of the architecture.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Software Reference Architecture</kwd>
        <kwd>Integrated Neural-Symbolic AI</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Newsroom</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>complex systems that must balance many concerns [7]
and are thus challenging to adopt without architectural
News organizations must constantly adapt their business guidance; (b) must interoperate with a wide variety of
models to digital media innovations to improve informa- in-house legacy systems and external services, including
tion quality, competitiveness and growth. This changes other JKPs [8]; and (c) are long-term investments that
how journalists and readers interact with news content must be able to evolve to incorporate future best-of-breed
and background information [1]. News agencies use big components that replace or come in addition to existing
data and artificial intelligence (AI) for diferent journ- ones [7]. On the technical level, JKPs also need to support
alistic purposes [2] that include: identifying and con- (a) the ingestion of real-time news items from multiple
textualizing newsworthy events for investigated journ- sources of unstructured and semi-structure data which
alism; facilitating data visualization in digital journal- must be semantically annotated and represented (viz.,
ism; automating news writing in robot journalism (a.k.a. lifted) and enriched in the knowledge base [9]; (b) the
algorithmic journalism or automated journalism); and production of potentially newsworthy events which are
providing real-time fact-checking tools for political journ- continuously pushed to journalists [10]; (c) the
diferalism. Journalistic Knowledge Platforms (JKPs) integrate ent services for pulling information from the knowledge
these and related techniques into knowledge-centric big- base [11]; and (d) mechanisms for continuously evolving
data platforms. They are an emerging type of complex in- and adapting machine learning models and ontologies
formation system that integrates state-of-the-art Neural- for curating and enriching the knowledge base [12].
Symbolic AI techniques [3] such as machine learning A software reference architecture (SRA) “is a generic
(ML), semantic knowledge representations and natural architecture for a class of systems that is used as a
foundlanguage processing (NLP) to support daily processes in ation for the design of concrete architectures from this
newsroom workflows [ 4, 5]. In our research, we are focus- class” [13]. It defines the basic software elements and
ing on JKPs that employ semantic knowledge graphs [6] data flows that implement the functionalities of, and
capfor representing knowledge. tures the best practices for designing and implementing</p>
      <p>It is challenging for news organizations to evolve the complex information systems like JKPs. We can
distinmany independent and task-specific systems they run guish two types of SRA: practice-driven and
researchtoday into cohesive and comprehensive JKPs [5]. On the driven [14]. Practice-driven SRAs are based on practical
management level, central challenges are that JKPs (a) are experience developing concrete architectures in a
domain. They describe the “best-practices” and address
legacy problems. Research-driven SRAs are designed for
a class of systems where there are no experiences on
developing them and are expected to become relevant in
the future. They are based on the related research
experiences. To a news organization, an SRA would bring a
blueprint along with advice for how to evolve its current</p>
    </sec>
    <sec id="sec-2">
      <title>3. Related Work</title>
      <sec id="sec-2-1">
        <title>3.1. Journalistic Knowledge Platforms</title>
        <p>news production systems to become an integrated JKP.</p>
        <p>Researchers have proposed several software
architectures to deal with big data [15]. But most of them did not
consider mechanisms for semantically representing and
enriching heterogeneous data, continuously pushing live
streams of data to the users, curating knowledge bases,
and maintaining machine learning models. Hence, none
of them are suitable for JKPs, or even for being adapted
to the JKP domain. Therefore, in this paper, we address
the question: “What would be a good software reference
architecture for journalistic knowledge platforms?”. To
address this research question, we followed a design
science approach [16] to develop a research-driven SRA for
JKPs along with a proof-of-concept prototype of a JKP
platform. To guide development we identiefid
opportunities and challenges for JKPs from the literature [5], relied
on our experience from developing previous JKP
prototypes [17, 18], and collaborated with a software developer
for the international newsroom market [17].</p>
        <p>The rest of the paper is organized as follows: Section 2
summarises the research method; section 3 analyses the
related literature, section 4 outlines the concerns for an
SRA, section 5 introduces the SRA for JKPs, section 6
describes the implementation of the proposed SRA, and
section 7 states our conclusions and further work.
Lambda processing architecture [28] to RDF1 for gather- C5 must provide diferent services for pulling
informaing, storing and serving big data in real time. Bolster [15] tion from the knowledge base [11];
extended the Lambda architecture by adding a new
semantic layer to represent machine-readable metadata, C6 must continuously evolve machine learning models
contrary to JKPs that represent the data semantically. for enriching the knowledge base;
However, none of them cover mechanisms for enriching C7 are knowledge-centric systems that provide
knowsemantic data, continuously pushing live streams of data ledge representations of news, events and
backto users, and keeping machine learning models up-to- ground information from the real-world which is
date. Pääkkönen and Pakkala [29] proposed an SRA for constantly changing [12];
machine learning in edge computing environments that
considered updating and maintaining of machine learn- C8 contain diferent databases for specific purposes like
ing models, but did not consider semantic technologies. multimedia files, historical archives and real-time</p>
        <p>The Lambda architecture has been criticised for its news feeds [7];
design: data and code are duplicated in two layers, namely C9 must facilitate schema evolution [12];
speed and batch layer, and data requests need to be
coordinated between both. This increases development, C10 need to consider privacy, provenance, terms-of-use
implementation, maintenance eforts, and hardware de- and data quality [8];
mands [30, 31]. The Kappa processing architecture was
therefore proposed [30]: it removes the batch layer, only C11 support news production where time is a critical
deals with real-time computation, and provides a single factor and delays can lower the value of
informadata view which is only changed when the code is up- tion [8];
dated and the old view is recomputed. However, the batch C12 and must consider big data properties such as data
jobs are not clearly defined and if needed, they have to heterogeneity, volume and velocity [10].
reprocess the current data view [31]. To overcome this
challenge, Cerezo et al. [31] proposed the Phi processing In addition, a good SRA must satisfy the usual
requirearchitecture, which is inspired by Lambda, but delays the ments of being feasible, representative, essential, easy to
data stream replication after the real-time computation grasp, long lasting, and technology independent [13, 33].
is done and provides a single data view. We return to discuss these concerns in Section 5.</p>
        <p>To the best of our knowledge, no suitable SRA for
JKPs has been proposed. Our proposed SRA for JKPs is 5. Software reference architecture
inspired by the Phi processing architecture to overcome
the challenges of the previous big-data architectures. It is for JKPs
focused on representing and enriching data semantically
and keeping up-to-date machine learning models, along
with the identified domain specific concerns of JKPs.</p>
        <sec id="sec-2-1-1">
          <title>The proposed SRA for Journalistic Knowledge Platforms</title>
          <p>(Fig. 1) is organized as five core services: the Ingestor,
Knowledge Base, Curator, Feeder and Retriever. Each
service comprises several micro-services, each of which is
4. Concerns for an SRA designed as an independent component with a clear API
facilitating its replacement and integration, making it
We identified the following main concerns from previous easy to scale and distribute [34] (C1, C2). Solutions like
studies [5, 32] and the analysis of similar systems [8, 7, 9, Docker2 can be used to improve the availability and
re12, 11, 10]: placeability of micro-services.</p>
          <p>C1 JKPs must interoperate with heterogeneous in-house The SRA data flow and processing steps are inspired
legacy systems, external services, and other JKPs [8];by the Phi architecture [31], which is designed for big
data and delays the downstream processing as much as
C2 must be able to incorporate future components that possible after the real-time processing (C11, C12). The
replace existing ones [7]; real-time processing step applies the data
transformations once and for all near the sources, providing and
C3 must ingest real-time news items from multiple het- combining both knowledge representations and
unstrucerogeneous sources that must be semantically tured data. Hence, the SRA facilitates the integration of
lifted [9]; neuro-symbolic AI by provisioning both types of data
C4 must produce potentially newsworthy events which from the beginning.</p>
          <p>are continuously pushed to journalists [10]; To eficiently process data streams, JKPs must exploit
concurrent and parallel processing. For example, JKPs</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>1www.w3.org/TR/rdf-schema</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>2www.docker.com</title>
          <p>can integrate solutions like Apache Spark3 for batch and
Apache Storm4 for real-time processing. To guarantee the
message distribution along the diferent services, JKPs
can employ message systems like Apache Kafka 5, and
serialize the messages using JSON-LD6, an extension of
JSON for representing linked data.
a result, the Ingestor processes the real-time
transformations once and near the source before they are stored in
the Knowledge Base, and further processed by the Feeder
and Curator.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>5.2. Knowledge Base</title>
        <p>5.1. Ingestor The Knowledge Base provides persistent storage for source
items and knowledge representations. It is composed of
The Ingestor is an Extraction-Transformation-Loading (ETL) multiple instance of dedicated databases for multimedia
service where a Harvester continuously downloads and and text files such as the Source Texts service. It
coningests scheduled and real-time news, social media mes- tains the Knowledge Graph for representing news-related
sages and multimedia items from sources like RSS, APIs knowledge (C8), which we designed to provide a single
and web-sites (C1, C2 C3). Additional services such data view that constantly changes to capture real-world
as the Translator and Filterer pre-process and clean the evolution and avoid unnecessary delays (C9, C11). The
downloaded items, for example, translating text into a ca- Knowledge Base also provides a middleware to
internonical language, normalizing data types, standardizing act with the diferent repositories and integrate legacy
formats, and filtering advertisements. Lifters then annot- archives (C1).
ate and transform the news-related items into knowledge- These storage services must be prepared to handle
graph representations in real time (C7) before they are up- large volumes of data and intensive write operations
loaded to the Knowledge Base. These knowledge graphs in real time (C12). For example, distributed databases
are represented in RDF following predefined ontologies like Apache HBase7 and Cassandra8 can store large data
like the Event Description Ontology [35] (C3), made gen- volumes. Although many of the open source graph
dataeral to facilitate schema evolution (C9) and data exchange bases and triple stores with support for RDF and SPARQL
(C1). are not distributed, some of them can hold more than</p>
        <p>The Lifter is composed of AI modules. which we de- one billion (109) triples [39] (e.g., Blazegraph9 and Jena
signed to be replaced or extended to follow the state- TDB10). Strategies to manage graph database size include:
of-the-art [36] (C2). To combine the results from the (a) partitioning the graph database according to resource
diferent AI modules, these can use NIF [ 37] or NAF [38] types/predicates, themes or geolocations, or a
combinvocabularies to standardize their annotations. Every an- ation of these; and (b) reducing the data stored in the
notation must provide quality information (e.g., accuracy Knowledge Graph by only keeping the knowledge
repand support values) and provenance to trace back the resentations and storing the textual and multimedia
inannotations to the process that generated them (C1). As formation in dedicated databases.</p>
        <p>3spark.apache.org
4storm.apache.org
5kafka.apache.org
6www.w3.org/TR/json-ld11</p>
        <sec id="sec-2-2-1">
          <title>7hbase.apache.org 8cassandra.apache.org 9www.blazegraph.com 10jena.apache.org</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>5.3. Curator</title>
        <sec id="sec-2-3-1">
          <title>The Curator provides a diverse set of functionalities, com</title>
          <p>prising services with diferent behaviors. It improves the
data in the Knowledge Graph, produces new
representations for diferent journalistic purposes ( C7), and updates
AI/ML models (C6). For example, the Enricher enhances
the Knowledge Graph representations using external
information from the LOD (e.g., Wikidata); the Model
Updater keeps up-to-date and re-trains AI/ML models; the
Event Detector and the Network Analyzer use
information from the Knowledge Base to identify events and
networks of actors; and the Licensing and Privacy Managers
propagate prohibitions, permissions, and obligations to
lfag and rectify violations ( C10).</p>
          <p>Curator services can be analyzed from diferent
perspectives: what triggers the service, what information is
used, and the main information flow. Curator services
such as the Enricher and the Event Detector continuously
ingest lifted items from the Knowledge Base, and can use
information from external and internal sources
respectively to generate new knowledge. Other services such
as the Licensing and Privacy Managers outputs alerts
that need to be addressed by the user. Others such as the
Model Updater and the Network Analyzer are triggered
periodically. For example, the Model Updater can be
automatically triggered hourly or even more frequently to
adapt real-time models to the current events using the
most recent data, and it can also be triggered daily or even
weekly to re-train models to follow current developments.
The Model Updater can access any data in the Knowledge
Base, having access to both semantic representations and
raw sources. Since every semantic representation must
reference the source of its annotations, it facilitates the
generation of training data.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>5.4. Feeder</title>
        <p>The Feeder continuously monitors the streams of linked
data fragments coming from the Ingestor to push
realtime news information to the user (C4, C11). For
example, it provides news feeds to users, suggests emerging
stories, identifies trends, and allow journalist to follow
current stories or developments.</p>
      </sec>
      <sec id="sec-2-5">
        <title>5.5. Retriever</title>
        <sec id="sec-2-5-1">
          <title>The Retriever lets users pull information from the JKP</title>
          <p>on demand. It provides a front-end to access, visualize
and explore the Knowledge Base and its services (C5).
For example, the Retriever can provide an end-point with
pre-packaged SPARQL queries for particular purposes,
like finding news stories related to a particular person or
retrieving relevant background information for a given
news event.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Prototype</title>
      <sec id="sec-3-1">
        <title>We are developing a prototype that instantiates the SRA</title>
        <p>for JKPs (Fig. 2). The current version implements
central parts of the Ingestor (Harvester and Lifter),
Knowledge Base (Source Text and Knowledge Graph),
Curator (Enricher and Event Detector) and Retriever (Pull
API) services. We used Docker Swarm for orchestrating
and replicating the services. The platform runs on 17
cloud instances (with total of 38 vCPU, 152GB RAM and
20TB disk). The messages are serialized using JSON-LD
and passed between services using Kafka. Previous
prototypes running on a simpler infrastructure [17] have
already implemented additional parts of the Feeder,
Curator and Retriever, which we plan to adapt into the
current prototype.</p>
        <p>We designed the Ingestor with services for Harvesting
and Lifting. Our Harvester crawls news-related websites
and harvests RSS feeds, Twitter accounts, NewsAPI11
and GDELT12. The Twitter API provides real-time tweets
streams from specific accounts, geographical areas or
topics. NewsAPI aggregates and provides streams of
news articles from over 50000 news sources and blogs.</p>
        <p>GDELT provides semi-structured information about
conlfict events from news all around the world, that have
been automatically translated to English from 65 diferent
languages. Our Lifter [36] transforms the news and event
streams in real time into semantic knowledge
representations following the Event Description Ontology [35]. It
employs out-of-the-box NLP systems such as DBpedia
Spotlight13 and SpaCy14 for semantically annotating
textual items and linking them to Wikidata and DBpedia.</p>
        <p>To integrate their outputs, we use NIF because it is the
only vocabulary based on RDF and OWL for describing
NLP annotations. The Knowledge Base includes a Source
Texts service implemented with Apache Cassandra and
a Knowledge Graph implemented with Blazegraph.
Cassandra has been used to store the textual information
together with the IRIs of the news items represented in
the Knowledge Graph. This decision allowed us to
reduce the data stored in the Knowledge Graph; provide
provenance by tracking news representations back to
its source; and facilitate new training material for ML
models based on the current state of our system. The
Curator implements an Enricher and Event Detector. Our
Enricher extends the lifted items with location-related
background information extracted from DBpedia. Our
Event Detector provides journalists with real-time events
detected from GDELT streams. The Retriever exposes an
API for pulling lifted news items from the Knowledge
11newsapi.org
12www.gdeltproject.org
13www.dbpedia-spotlight.org
14spacy.io</p>
        <p>Base for a research challenge task15, allowing external summarize the mapping between the discussed concerns
users to interact with our system. and the architecture patterns and core services.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusion</title>
      <p>We have proposed an SRA for JKPs to support the adop- Architecture pattern Concern
tion of JKPs in newsrooms. Our architectural decisions
were based on reported experiences with existing plat- Micro-services C1,C2
forms, supported by our own experience developing a Phi architecture C11,C12
JKP in collaboration with industry. The SRA supports
implementing, maintaining, and evolving JKPs. It
represents those components and functionalities that are essen- Table 2
tial for JKPs and provides a vocabulary to compare and Mapping between core services and concerns
understand diferent JKP realisations. We designed the
SRA to be technology independent, open-ended and long- Service Concern
lasting, with components and services that can evolve Ingestor C1,C2,C3,C7,C9
or be replaced and integrated with legacy systems. The Knowledge Base C1,C8,C9,C11,C12
SRA considers the integration of AI components, the Curator C6,C7,C10
evolution of ML models, and provides both knowledge Feeder C4,C11
representations and unstructured data together to facilit- Retriever C5
ate the integration of neuro-symbolic AI. We have also
presented a proof-of-concept implementation of a JKP
that instiantiates the proposed SRA.</p>
      <p>Because JKPs handle big data, they need a processing 8. Further work
architecture that minimizes delays and maximizes
information flow. Thus, the SRA is inspired by the Phi We plan to evaluate the correctness and the utility of
architecture and provides a single data view through the the proposed SRA, being this part one of the main
chalKnowledge Base. This reduces unnecessary data duplic- lenges of designing an SRA [40, 41]. Hence, we plan to
ation, does not require coordination between diferent map existing JKPs from the literature into our proposed
data views, and makes information available as soon as it SRA to verify that it covers the essential components of
arrives in the Knowledge Base. To reduce delays further, JKPs and evaluate its correctness. We also want to
conthe SRA also includes a component for pushing streams duct a qualitative evaluation of the SRA with component
of newsworthy events to journalists. developers to validate its feasibility, understandability</p>
      <p>In Section 5 we have discussed how the proposed SRA and utility. This qualitative evaluation would be
comimplements the identified main concerns. Tables 1 and 2 posed by questionnaires inspired by the ISO/IEC 25000
standard and a series of interviews. Besides, we plan to
extend our prototype to facilitate empirical evaluation of
the SRA instantiation. Possible empirical paths include:</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>