<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Doctoral Consortium, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Expressing Biological Problems with Logical Reasoning Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Alfonsi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Bellomarini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Bernasconi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Ceri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Banca d'Italia</institution>
          ,
          <addr-line>00184, Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dip. di Elettronica, Informazione e Bioingegneria, Politecnico di Milano</institution>
          ,
          <addr-line>20133, Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>6</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Biology represents a very challenging domain that is typically tackled by experts in the field, with few or no interactions with the Web knowledge and rules interoperation community. However, there has been a considerable growth of data regarding biological aspects in the last decades. Moreover, the COVID-19 pandemic has traced an unprecedented point in history, where tons of information have been collected in laboratories worldwide and deposited into open data banks. Inspired by the current needs and backed by a solid knowledge base (our extensional knowledge source) called CoV2K, we propose to express and resolve a series of problems related to the SARS-CoV-2 virus and its interpretation. We formulate our queries as rules in Vadalog (our knowledge representation and reasoning language) and input them to its related logic-based reasoning system. Four cases are presented that allow to explore 1) variants efects and how they are explained in scientific literature; 2) the most typical mutations of a variant; 3) the most likely acquisition of a new mutation by a given variant and the associated reported efects; 4) the most relevant mutations of the virus according to the community. Expressing biological problems using a logic formalism is a major challenge, due to the intrinsic complexity of the domain. The four use cases show that a logical formalism is efective in expressing relevant problems for understanding the current evolution of SARS-CoV-2 variants, an essential aspect of the COVID-19 pandemic.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Logical reasoning</kwd>
        <kwd>Vadalog</kwd>
        <kwd>Biology</kwd>
        <kwd>Virology</kwd>
        <kwd>SARS-CoV-2</kwd>
        <kwd>COVID-19</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Cancer treatment, personalized medicine, and vaccine development are only few of the
application fields where computer science and artificial intelligence are being widely adopted to
achieve important discoveries. One apparent challenge is that of expressing complex
domainspecific problems – typically conceived by biologists or clinicians – into a clear and actionable
formalization, which then enables sophisticated query answering and data analysis.</p>
      <p>The use of declarative languages for expressing and resolving biology-related problems has
thus far been sporadic. Attempts date back to 1996, when a declarative programming language
was used to describe nucleic acids’ secondary structures [1]; along a thirty-year range, we
can report a web-based declarative workflow language for Life Sciences [ 2], a layer-oriented
approach for biological modeling [3], and, more recently, Datalog extensions for bioinformatic
data analysis [4]. Declarative languages eliminate the need to program complex control flows
or to design algorithms. Instead, problems are expressed at a high level, in a readable, modular
syntax and reduced code footprint. These aspects are extremely convenient in the life sciences,
where high-level problem formalization enables immediate and simpler communication with
domain experts.</p>
      <p>Contributions are missing that fully exploit the power of declarative languages for supporting
the expression of interesting queries over biologically-relevant data, paving the way to their
resolution. To demonstrate our claim and attract interest towards this novel challenge, we draw
on a particularly current domain, that is, the data and knowledge related to SARS-CoV-2, the
virus responsible for the COVID-19 pandemic. In the past two years there has been a wealth
of available data and knowledge collected on open data repositories and literature archives;
moreover a multitude of open questions of broad research interest have been and still can be
formulated on this topic.</p>
      <p>As a starting point to this challenge, we chose a basic methodological framework that supports
our proposal with i) a Knowledge Representation and Reasoning (KRR) language (Vadalog [5]);
ii) a source of extensional knowledge (CoV2K [6]); and iii) a logic-based reasoning system (the
Vadalog System [7]). Vadalog is a KRR language of the Datalog± family [8]; it supports PTIME
ontological reasoning and high expressive power, capturing Datalog, and thus featuring full
recursion, while covering all SPARQL queries under the entailment regime of OWL2QL. The
extensional knowledege is extracted from CoV2K, a knowledge base that integrates data about
viral sequences and established information about the virus (e.g., variants, mutation, their
efects or protein structures). The Vadalog System then interprets the logic-based rules and
derives new knowledge in the form of new facts, via the reasoning process. The integration
of these three components (Vadalog, CoV2K, and the Vadalog System) allows to express and
reason on a wide range of problems that could explain the complex mechanisms of the virus
that has spread and mutated worldwide in the last 2.5 years. We exemplify the application of
the proposed framework through four use cases. The first two introduce the reader to the use
of the knowledge base CoV2K through simple queries and aggregations. The third use case
explores a graph of frequently co-occurring genomic mutations in viral samples and then draws
conclusions on the variants that are prone to acquire novel biologically relevant features. The
last example provides two recursion examples while searching for the most relevant mutations
of the Spike protein in the scientific literature.</p>
      <p>The remainder of this paper is structured as follows. Section 2 provides a background on
declarative languages and presents the Vadalog language. Section 3 gives an overview of the
CoV2K knowledge model and how this resource is used within Vadalog. Section 4 presents a
set of interesting applications concerning the SARS-CoV-2 virus and the possibility to derive
knowledge in this domain. Finally, Section 5 discusses the implications of our approach and
draws the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Using Declarative Languages and Their Applications</title>
      <p>At the core of our methodological approach, there is a Knowledge Graph Management System
(KGMS), a tool that enables reasoning on knowledge graphs, a data model for KRR. A knowledge
graph is essentially composed of three parts: (i) a ground extensional component, which
includes ground knowledge about a domain of interest; in logical approaches, this component
is represented as facts, i.e., ground atoms, defined on domain constants; (ii) an intensional
component, that encodes domain knowledge, for instance if a logical approach is adopted,
as a set of rules that define domain objects on the basis of others; (iii) a derived extensional
knowledge, which is the set of facts resulting from the application of the intensional logic rules
to the extensional knowledge, in the reasoning process.</p>
      <p>Formalizing complex intensional knowledge poses a number of requirements to the adopted
logical language, which must strike a favourable balance between expressive power and
computational complexity, so as to capture all the features of real-world scenarios while keeping
decidability and tractability.</p>
      <p>Specifically, in order to address a broad class of biological problems, we resort to Vadalog, a
language for ontological reasoning with support for recursion and existential quantification.
Vadalog is a language—technically, a fragment—of the broad Datalog± family of database
languages [8], whose core revolves around Warded Datalog± [9]. Warded captures Datalog (thus
incorporating full recursion) and allows the use of existential quantification; at the price of very
mild syntactic restrictions, it guarantees PTIME query answering. Moreover, the Piecewise [10]
restriction of Warded, applicable to a wide variety of settings, addresses extreme-scale settings,
thanks to sub-polynomial complexity of the query answering task. Vadalog extends Warded
with features of practical utilities such as monotonic aggregations [11], conditions, and algebraic
operations.</p>
      <p>Like in plain Datalog, Vadalog rules are first-order dependencies of the form ∀∀ ((, ) →
∃  (, )), where  (the body) and  (the head) represent a conjunction of atoms, ,  and
 are tuples of variables. The reasoning process incrementally satisfies the rules, if the case
introducing new values (namely, labelled nulls) for the existentially quantified variables of .</p>
      <p>The Vadalog language is implemented by a fully engineered system, the Vadalog Engine [7],
currently used to compute the derived extensional component in multiple production scenarios,
so far in the financial setting.</p>
      <p>The well-known advantages of high-level declarative approaches that we have recalled,
combined with the high expressive power and scalability of Warded Datalog±, along with the
availability of a state-of-the-art execution runtime motivated our choice for Vadalog as the
reference KRR language to encode our knowledge about COVID-19, namely CoV2K [6].</p>
    </sec>
    <sec id="sec-3">
      <title>3. A Knowledge Model for SARS-CoV-2 Viral Sequences</title>
      <p>In [12, 6] we presented CoV2K, a model of concepts related to the SARS-CoV-2 virus data and
knowledge; this model drove the integration process to build a novel source of curated and
interlinked instances.</p>
      <p>Figure 1 illustrates a portion of the original CoV2K model, including the relevant concepts for
effectProperties(e,type,level,method)</p>
      <p>Effect
evidenceOfEffect(ev,e)</p>
      <p>Evidence
evidenceProperties(ev,cit,type,doi,publisher)</p>
      <p>Naming
characterizationProperties(c,o,v_class)VARIANTS</p>
      <p>Characterization</p>
      <p>POSITIONS OF INTEREST
the use cases demonstrated in this paper. Information inside the knowledge model is represented
as a collection of entities and edges of a graph. Entities populate the areas of the knowledge
model and are named as the category of information they contain. Three distinct areas on the
left of the figure regard established knowledge. The VARIANTS area denotes the variants of
SARS-CoV-2 (i.e., viruses which have progressively evolved from the original Wuhan virus by
accumulating mutations). The same variant can be named diferently and can be described by
a diferent set of characterizing mutations (i.e., most relevant or prevalent mutations); these
are described by the Naming and Characterization entities, which can both be assigned by
multiple organizations. Mutational processes concern specific positions of a variant sequence;
in this work, we consider the mutations that afect viral proteins, also called amino acid changes
(AA positional changes). The EFFECTS area describes efects that are relevant from the
points of view of epidemiology (e.g., viral transmission and disease severity) or immunology
(e.g., host receptor binding afinity and sensitivity to monoclonal antibodies). An Effect is
described as the consequence of a single amino acid change or of a variant. Each Effect
is associated with the source of information that reports it, i.e., Evidence from scientific
literature. The POSITIONS OF INTEREST area contains all the possible mutations (here called
AA Positional Changes), which fall within annotated regions of the genome, i.e., proteins,
and are present in some variant characterization.</p>
      <p>The two areas on the right part of the model contain instead data collected and deposited
within large databases. The SEQUENCE DATA area concerns viral samples (in the Sequence
entity) provided by laboratories with their describing metadata. Out of the many aspects that
describe each sequence, we hereby consider only the protein mutations (i.e., AA changes).
Depending on the characteristics of the viral host organism, given protein regions of the virus
may be antigenic and elicit an immune response; the EPITOPES area describes these regions by
means of the Epitope entity, whose attributes include the starting and ending position of the
region within a protein. Note that epitopes can be connected to the AA positional changes
in the POSITIONS OF INTEREST area (and the AA changes in the SEQUENCE DATA area)
when the position attribute pos of a change is included in the start-stop range of the epitope.</p>
      <p>Within and across the areas, entities are connected by relationships of various types, defining
the connection between their concepts: indirect edges for many-to-many cardinality, direct edges
for one-to-many cardinality, empty-point-arrow edges for generalization hierarchy. Dashed
lines represent connections between parts of the model that are based on positional information.</p>
      <p>In building the CoV2K content, we employed a classical data integration process driven
by the model, with pipelines for the integration and harmonization of diferent data silos.
Knowledge areas are filled via the most updated sources in the landscape of SARS-CoV-2-related
information, including CoVariants.org [13], Public Health England (PHE, [14]), the COG-UK
Mutation Explorer [15], ECDC [16], and several preprints or published papers deposited on
bioRxiv, medRxiv, or PubMed. For what concerns sequence data, CoV2K includes two large
databases, GenBank and COG-UK as extracted via our ViruSurf pipeline [17, 18]. We include
epitopes experimented for SARS-CoV-2 from the Immune Epitope Database (IEDB [19]).</p>
      <p>According to our methodological framework, we selected CoV2K as the principal source
of extensive knowledge. Several key features justify such a choice. First, the graph-like
organization of information into entities and relations makes the translation into predicates
straightforward. Second, the availability of already cleaned/integrated data and
knowledgerelated information is a unique feature of CoV2K that greatly simplifies the process of data
ingestion. Third, CoV2K is provided online through an application programming interface (API)
at http://www.gmql.eu/cov2k/api; as APIs are meant to connect systems, a multitude of data
processing systems supports them, including Vadalog. Thus, we are able to query CoV2K and
integrate the request’s output for further analysis directly inside the reasoner.</p>
      <p>In order to use CoV2K within a declarative language, we defined Vadalog rules’
headpredicates that correspond to:
• Entities, using the notation ⟨⟩ ({  }); for example,
efectProperties(e,type,level,method) is an atom whose variables evaluate to any instance
of the Effect entity.
• Relationships between two entities, using the notation ⟨1⟩Of⟨2⟩(1, 2);
for example, efectOf Variant(e,v) evaluates to all the pairs of efects e and variants v, where
e is an instance of the entity Variant effect, v is an instance of the entity Variant,
and there exists a path connecting v and e within CoV2K.</p>
      <p>The binding of CoV2K instances to the variables inside predicates is made
by replacing the variables to valued constants. Namely, an atom such as
effectProperties(1143, “infectivity"", “higher", “epidemiological”) is called a fact and
denotes an instance of the Effect entity having the ID 1143 and describing higher infectivity
ascertained through an epidemiological study. Rules’ body-predicates are freely named
according to the semantics of the use cases.</p>
      <p>
        Furthermore, we present a recurrent predicate form that allows to group several instances
of the CoV2K model into a set-like variable. To this aim, we adopt operators denoting
monotonic aggregation, such as “munion", that is used to recursively form a set as the union of its
components. In the example rule (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), we iterate over the variables of the atom
aaChangeOfCharacterization and, for every characterization , we add the amino acid changes  of  to the set
.
      </p>
      <p>aaChangeOfCharacterization(a, c), A = munion(a),</p>
      <p>→ aaChangeSetOfCharacterization(A, c)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Applying Vadalog to Viral Genomics Problems</title>
      <p>In the following, we show a number of formalized examples of biological problems in the
domain of viral sequences. Following a pedagogical approach, the first two examples are meant
for readers to get acquainted with the Vadalog syntax and the predicates representing the
knowledge graph of CoV2K. The third use case presents a more advanced application that
uses several key features of Vadalog: monotonic aggregates, graph exploration and Set objects.
The example proposes possible new SARS-CoV-2 variant characterizations, suggested by the
co-occurrence of new amino acid changes with existing current variant characterizing changes.
The immunogenic consequence of each acquired change is then evaluated through CoV2K by
exploring the associated Change effect and the newly modified Epitope regions. The last
example finds the most discussed amino acid changes in the scientific literature by traversing
the graph generated by references of scientific articles. In doing so, we provide an example of a
recursive procedure and show how it is implemented in Vadalog.</p>
      <sec id="sec-4-1">
        <title>4.1. Most Represented Efects of Variants in Published Scientific Articles</title>
        <p>
          Variants difer in their genomic characteristics and hence in their efects. Here, we are interested
in finding which are the variant efects that are most discussed in the scientific community.
This can be formalized as in rule (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ).
        </p>
        <p>effectOfVariant (e, v ),
effectProperties(e, type, level , method ),</p>
        <p>
          evidenceOfEffect (ev , e),
evidenceProperties(ev , cit , “published", doi , publisher )
→ publishedVariantEffects (type, level , method , count ()),
In the rule body, we join diferent predicates from CoV2K EFFECTS area through the efect
e: efectOf Variant , efectProperties , evidenceOfEfect , and evidenceProperties. We select only the
efects from peer-reviewed scientific articles by setting the value “published” as the evidence
type. Finally, in the rule head, we count the occurrences of every combination of the efect’s type,
level and method attributes. The count for the tuple ⟨type, level, method⟩ is obtained through
grouping by the number of variants v and evidence ev sharing the efect. By sorting the values
in publishedVariantEfects by descending count, we obtain the most cited variant efects. As
shown in Table 1, the most relevant results are those related to lower efectiveness of vaccines /
sensitivity to convalescent sera and to higher risk of hospitalization / disease severity.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. The Delta Variant Characterizations</title>
        <p>Since the beginning of the pandemic, several organizations provided their own nomenclature
and description for variants. The CoV2K knowledge model clusters diferent namings for each
variant; for example, the well known Alpha variant (as named by the World Health Organization,
WHO [20]) was also called B.1.1.7 and VOC-20DEC-01, among others. A user may want to
request the variant’s characterization based on one of the many available names. Suppose
we are interested in the Delta variant; we traverse the edge predicate namingOf Variant, by
setting “Delta" as the naming attribute. From the variant v, we can find the characterization
c of a variant according to the organization org. We then obtain the amino acid changes that
are part of the characterization through the relation aaChangeOfCharacterization. As multiple
organizations may provide the characterization of a variant, the inclusion of org in the rule’s
head allows to indicate the source.</p>
        <p>
          namingOfVariant (“Delta", v ),
characterizationOfVariant (c, v ),
characterizationProperties (c, org , v _class),
aaChangeOfCharacterization(a, c)
→ deltaCharacterization(v , org , a)
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
The rule (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) produces triplets ⟨, , ⟩, where each amino acid change  is associated to the
kind of variant  and organization  who provided the characterization. With an additional
rule (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) we build one set of mutations for each pair ⟨, ⟩. The result of rule (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) clarifies
that the WHO-named Delta variant maps to two distinct but similar variants  commonly
referenced in the scientific community as “B.1.617.1” and “B.1.617.2” using the Pango [ 21]
nomenclature. CoV2K provides two complete characterizations for each of them, according
to the variant description published by the organizations () Public Health England (PHE)
and Covariants.org. As a result, according to PHE, the variants “B.1.617.1” and “B.1.617.2”
are described with respectively 10 and 8 amino acids, while Covariants includes many more,
reaching 43 and 22 characteristic amino acids respectively. In Table 2 we show some of the
amino acid changes corresponding to rule (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Severity of Putative New SARS-CoV-2 Variants</title>
        <p>Since the beginning of the pandemic, hundreds of SARS-CoV-2 variants spread across the world,
and some of them completely replaced the other circulating variants in a few weeks; this is
the case of the Delta variant, for example. Variants are essentially a collection of amino acid
changes that are found to consistently co-occur in sequences. Such amino acid changes can bring
substantial advantages or disadvantages to the virus in terms of infectivity, infection severity,
among other aspects. Since a new variant could spread quickly and bring unexpected threats to
living beings, it is important to monitor circulating viral sequences. Putting in place methods
that anticipate variants contributes to develop adequate countermeasures against the evolving
pandemic. In this use case, we focus on the characterization of previously unobserved and
possible emerging SARS-CoV-2 variants. Then, using CoV2K, we discover the efects possibly
inherited by the novel characterization as well as their consequences from the point of view of
immunogenicity.</p>
        <p>
          Using CoV2K, we can browse millions of SARS-CoV-2 sequences collected worldwide and look
at their genomic features. Changes inside a sequence are not statistically independent events (for
biological reasons pertaining to the chemical stability and protein folding mechanisms); indeed,
some changes are found to co-occur in viral samples with a high frequency. The pairs of amino
acid changes that are frequently co-occurring are called coPairs in rule (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ). We assign to each
such pair a weight w corresponding to the number of sequences that exhibit the co-occurrence
relation divided by the total number of sequences in the database (totalSequences).
aaChangeOfSeq (a1 , s), aaChangeOfSeq (a2 , s), a1 ! = a2 ,
w = count (s)/ totalSequences
→ coPairs(a1 , a2 , w )
Considering a simplified problem setting, we assume that the frequent co-occurrence of two
amino acid changes is connected to an advantageous characteristic for the spread of the virus.
Therefore, given a frequently co-occurring pair of amino acid changes, there may be a high
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
chance of observing the same pair of changes also in future sequences. In fact, this assumption
is only valid in specific circumstances, not considered here for simplification.
        </p>
        <p>We can model a putative new SARS-CoV-2 variant (named novelVariantCharacterization in the
following) as an existing variant that acquires a change appearing in a frequently co-occurring
pair of amino acid changes. First, we consider the set A of characterizing changes of variant v,
computed as in rule 6.</p>
        <p>
          aaChangeSetOfCharacterization(A, c), characterizationOfVariant (c, v )
→ aaChangeSetOfVariant (A, v )
Then, for each pair ⟨a1,a2⟩ in a coPairs relation (where a1 is part of A and a2 is not) our rule (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
builds a novelChangeForVariant(v,a2). The result is equipped with a weight 2, representing
the maximum weight 1,2 among coPairs built between a2 and any change of A.
novelChangeForVariant (v , a, w ),
        </p>
        <p>aaChangeSetOfVariant (A, v ),
¬anyExistingCharacterization(B ), B = A ∪ {a}</p>
        <p>→ novelVariantCharacterization(v , A, a, w )
aaChangeSetOfVariant (B , v ) → anyExistingCharacterization(B )</p>
        <p>aaChangeSetOfVariant (A, v ),
a1 ∈ A, a2 ∉ A, coPairs(a1 , a2 , wa1 ,a2 ),</p>
        <p>
          wa2 = max (wa1 ,a2 )
→ novelChangeForVariant (v , a2 , wa2 )
With rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), we generate a novelVariantCharacterization by joining a novel change with an
existing aaChangeSetOf Variant. The third line in the rule’s body references the definition given
in rule (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ). The line serves the purpose of excluding any potential characterization coinciding
with an already existing variant characterization. To this aim, in the body of rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), we generate
a new variant characterization B, made by the characterization of v plus a, and check that it is
not present in anyExistingCharacterization.
        </p>
        <p>
          We can then find the efects of a novelVariantCharacterization through rule (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), where we join
the novel amino acid change with efectOfAAChange and efectProperties ; a small subset of the
extracted novelVariantEfects is presented in Table 3.
        </p>
        <p>
          novelVariantCharacterization(v , A, a, w ),
effectOfAAChange(e, a), effectProperties(e, type, level , method )
→ novelVariantEffect (v , a, w , type, level , method )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
        </p>
        <p>
          Some amino acid changes can be more relevant than others depending on their position
along the viral sequence; special regions along the viral proteins, called epitopes, are short
amino acids sequences recognised by the host organism as external organisms, thus capable of
eliciting an immunogenic response. The host organism may fail to recognize a mutated epitope,
so these regions are typically of large interest to scientists involved in immunological studies
and vaccine development – the typical purpose of vaccines is that of helping the organism to
recognize the virus epitopes. Amino acid changes can be linked to their enclosing epitopes
using positional information. In rule (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ), we extract pairs of amino acid changes a and epitopes
e that are aaChangeModifyingEpitope.
        </p>
        <p>
          aaChangeProperties(a, protein, pos, ref , alt , type, length)
epitopeProperties(e, protein, “human", start , stop, string , iedb_link ),
start ≤ pos ≤ stop
→ aaChangeModifyingEpitope(a, e)
As we are interested in the new epitopes modified by the acquisition of a novel change into
a variant characterization, we ignore the epitopes that are already modified. Rule (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) joins
aaChangeOfCharacterization with aaChangeModifyingEpitope and recursively forms the set E as
the union of all the epitopes modified by a characterization. Then, we join the characterization
to its variant.
        </p>
        <p>aaChangeModifyingEpitope(a, e), E = munion(e),</p>
        <p>aaChangeOfCharacterization(a, c),
→ modifiedEpitopesOfCharacterization(E , c)
modifiedEpitopesOfCharacterization(E , c),</p>
        <p>
          characterizationOfVariant (c, v )
→ modifiedEpitopesOfVariant(E , v )
In rule (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ), by joining the amino change a of a novelVariantCharacterization with
aaChangeModifyingEpitope, we can find epitopes that could potentially be mutated by the new characterization
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
of the variant. While doing this, we first exclude those epitopes that are already modified by
the same variant. Therefore, the epitope modified by the novel change must not appear in the
modifiedEpitopesOf Variant of v. In addition to the variant, the change and the weight, our result
(e.g., shown in Table 4) includes the epitope start and stop coordinates, the sequence of amino
acids (string) forming the epitope and the corresponding link (iedb_link) to the used epitope’s
data source, i.e., IEDB.
        </p>
        <p>novelVariantCharacterization(v , A, a, w ),</p>
        <p>
          aaChangeModifyingEpitope(a, e),
modifiedEpitopesOfVariant(E , v ), e ∉ E
epitopeProperties(e, protein, “human", start , stop, string , iedb_link )
→ novelModifiedEpitope(v , a, start , stop, string , w , iedb_link )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Amino Acid Changes Relevant for the Scientific Community</title>
        <p>Among the methods employed into COVID-19 vaccines to defend against the infection, one is to
teach the immune system to recognize and neutralise the Spike protein. This particular protein
surrounds the external surface of the viral capsid and starts the infection process by hooking
into the host cells. However, while the virus can quickly evolve and adapt to the environment,
vaccines take a long time to develop and be updated. Hence, a highly mutated variant of the
Spike protein could be treated as an unknown sequence by the host, evading the protection
given by the current vaccines. As such, biologists typically look to specific changes in the amino
acid sequence of the Spike protein with much interest.</p>
        <p>
          Based on such premise, the goal of this example is to find which Spike amino acid changes are
the most studied by scientists. To measure the interest in the scientific community, we resort
the the list of published articles discussing a change. In rule (
          <xref ref-type="bibr" rid="ref14">14</xref>
          ), we look for the amino acid
changes falling inside the Spike protein through the predicate aaChangeProperties where we set
the protein value to “Spike". We join the changes with their efects and the articles discussing
them. The predicate citedBy binds any Spike amino acid change a to the Document Object
Identifier (DOI) of the associated evidence, neglecting the information about the efect as it is
not relevant for our goal. Furthermore, we annotate citedBy with the distance (hops) between
the change and the evidence discussing it. As rule (
          <xref ref-type="bibr" rid="ref14">14</xref>
          ) explores the pieces of evidence that are
directly reported in CoV2K, we assume the distance to be the minimum value, equal to 1.
aaChangeProperties(a, “Spike", pos, ref , alt , type, len),
        </p>
        <p>
          effectOfAAChange(e, a), evidenceOfEffect (ev , e),
evidenceProperties(ev , cit , “published", doi , publisher ), hops = 1 ,
→ citedBy (a, doi , hops)
In addition, we can enrich the previous analysis by adding the resonance of a piece of information
in the scientific literature beyond its first appearance. To do so, we query the OpenCitations [ 22]
database through the function findCitations 1 which returns a list of DOIs referencing the input
argument. This feature is used inside rule (
          <xref ref-type="bibr" rid="ref15">15</xref>
          ) to find the scientific papers citing the articles found
in the previous rule. From an article in citedBy, we call the function findCitations , obtaining a list
of scientific papers citing the original one. The citationList is stored in a predicate citedByList
and connected to the referenced amino acid change a, the size of the list (lastIndex), and the
updated distance value (hops).
        </p>
        <p>citedBy (a, article, hops),
citationList = findCitations(article),
lastIndex = size(citationList )</p>
        <p>
          hops ≤ threshold,
→ citedByList (a, citationList , lastIndex , hops + 1 )
The procedure is repeated for as many articles as those in citedBy and until an arbitrary threshold
value for hops is reached; this, eventually, happens when new atoms of citedBy are created as
an efect of rule (
          <xref ref-type="bibr" rid="ref16">16</xref>
          ).
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
(
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
citedByList (a, citationList , lastIndex , hops),
article = citationList [lastIndex ], lastIndex &gt; 0
→ citedByList (a, citationList , lastIndex − 1 , hops),
citedBy (a, article, hops)
The purpose of rule (
          <xref ref-type="bibr" rid="ref16">16</xref>
          ) is to unpack a citationList into individual atoms of citedBy. The
generated instances of citedBy connect each item of citationList to the amino acid change from
citedByList. As new citedBy atoms are derived from rule (
          <xref ref-type="bibr" rid="ref16">16</xref>
          ), rule (
          <xref ref-type="bibr" rid="ref15">15</xref>
          ) runs again in a recursive
fashion to find additional citations of a change, up to the threshold distance.
        </p>
        <p>
          Rule (
          <xref ref-type="bibr" rid="ref16">16</xref>
          ) is also recursive: it considers a citedByList atom and uses the value of lastIndex
to access the list of items. At every iteration, the rule selects a new article from citationList
using the current value of lastIndex; then the value is decreased by one and saved into a new
citedByList. The procedure continues until lastIndex becomes zero, i.e., when all the list items
have been visited.
        </p>
        <p>1As Vadalog supports the invocation of external Java methods, we packed the instructions for making API
requests inside a Java executable file (JAR) that exposes the method findCitations .</p>
        <p>Finally, rule (17) counts the number of articles that reference Spike amino acid changes, by
aggregating the atoms of citedBy by change and hops. According to our initial assumption, the
value of citations reflects the interest of the scientific community. In the results (e.g., see Table 5)
we just list, for each amino acid change, the counts of citations for each hop. Note that the first
and third reported changes are located in a specific area of the Spike protein – the Receptor
Binding Domain – that is of particular interest as it corresponds to the first access point of the
virus into hosts’ cells [23].</p>
        <p>citedBy (a, article, hops), citations = count (article)
→ aaChangeRelevance(a, citations, hops)
(17)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>In this paper, we used a knowledge base and logical reasoning language for exploring a number
of crucial aspects of the evolution of SARS-CoV-2 variants, which are important, in turn, for
understanding the progress of the COVID-19 pandemic. We resorted to CoV2K, which collects
information about SARS-CoV-2 in a knowledge graph, as it is the result of a significant data
abstraction and integration process, which exploits large databases as well as ontological
resources. In our declarative problem descriptions, we used several key features of Vadalog,
including recursion, grouping with counting, and monotonic aggregation, whose expressive
power was suficient for our needs.</p>
      <p>The methodological framework hereby discussed is a showcase of how declarative languages
such as Vadalog could be applied to biological problems. Indeed, declarative languages allow
answering complex queries using clearly stated logical formulae, favoring the progressive
modeling and understanding of the essence of problems – during the rule production phase and
then for sharing and maintaining them. In contrast, biologists typically face these problems
through low-level abstractions, often based upon custom scripts, hard to explain and to maintain.
Thus, this paper introduces a major challenge – can a logical language efectively be used in
biology? The challenge is quite hard, as biology is intrinsically a very complex application
domain. Our experience shows some very concrete use cases as proof of concept, and provides
a first step along this direction.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research is funded by the ERC Advanced Grant 693174 GeCo (Data-Driven Genomic
[17] T. Alfonsi, P. Pinoli, A. Canakoglu, High performance integration pipeline for viral and
epitope sequences, BioTech 11 (2022) 7.
[18] A. Canakoglu, P. Pinoli, A. Bernasconi, T. Alfonsi, D. P. Melidis, S. Ceri, ViruSurf: an
integrated database to investigate viral sequences, Nucleic Acids Research 49 (2021)
D817–D824.
[19] R. Vita, S. Mahajan, J. A. Overton, S. K. Dhanda, S. Martini, J. R. Cantrell, D. K. Wheeler,
A. Sette, B. Peters, The immune epitope database (IEDB): 2018 update, Nucleic Acids
Research 47 (2019) D339–D343.
[20] World Health Organization, Tracking SARS-CoV-2 variants, 2022. URL: https://www.who.</p>
      <p>int/en/activities/tracking-SARS-CoV-2-variants/, last accessed: June 30th, 2022.
[21] A. Rambaut, E. C. Holmes, Á. O’Toole, V. Hill, J. T. McCrone, C. Ruis, L. du Plessis, O. G.</p>
      <p>Pybus, A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic
epidemiology, Nature Microbiology 5 (2020) 1403–1407.
[22] S. Peroni, D. Shotton, Opencitations, an infrastructure organization for open scholarship,</p>
      <p>Quantitative Science Studies 1 (2020) 428–444.
[23] J. Lan, J. Ge, J. Yu, S. Shan, H. Zhou, S. Fan, Q. Zhang, X. Shi, Q. Wang, L. Zhang, et al.,
Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor,
Nature 581 (2020) 215–220.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Billoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kontic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Viari</surname>
          </string-name>
          ,
          <article-title>Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence databases</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>24</volume>
          (
          <year>1996</year>
          )
          <fpage>1395</fpage>
          -
          <lpage>1403</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jamil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>El-Hajj-Diab</surname>
          </string-name>
          ,
          <article-title>Bioflow: A web-based declarative workflow language for life sciences</article-title>
          , in: 2008 IEEE Congress on
          <string-name>
            <surname>Services-Part</surname>
            <given-names>I</given-names>
          </string-name>
          , IEEE,
          <year>2008</year>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Raikov</surname>
          </string-name>
          , E. De Schutter,
          <article-title>The layer-oriented approach to declarative languages for biological modeling</article-title>
          ,
          <source>PLoS Computational Biology</source>
          <volume>8</volume>
          (
          <year>2012</year>
          )
          <article-title>e1002521</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <article-title>Datalog extensions for bioinformatic data analysis</article-title>
          ,
          <source>in: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1303</fpage>
          -
          <lpage>1306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pieris</surname>
          </string-name>
          , E. Sallinger,
          <article-title>Swift logic for big data and knowledge graphs, in: IJCAI, ijcai</article-title>
          .org,
          <year>2017</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Alfonsi</surname>
          </string-name>
          , R. Al Khalaf,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernasconi</surname>
          </string-name>
          ,
          <article-title>CoV2K model, a comprehensive representation of SARS-CoV-2 knowledge and data interplay, Scientific Data 9 (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          , E. Sallinger,
          <string-name>
            <surname>G. Gottlob,</surname>
          </string-name>
          <article-title>The vadalog system: Datalog-based reasoning for knowledge graphs</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>975</fpage>
          -
          <lpage>987</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Calì</surname>
          </string-name>
          , G. Gottlob,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lukasiewicz</surname>
          </string-name>
          ,
          <article-title>A general datalog-based framework for tractable query answering over ontologies</article-title>
          ,
          <source>J. Web Semant</source>
          .
          <volume>14</volume>
          (
          <year>2012</year>
          )
          <fpage>57</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pieris</surname>
          </string-name>
          ,
          <article-title>Beyond SPARQL under OWL 2 QL entailment regime: Rules to the rescue</article-title>
          , in: IJCAI, AAAI Press,
          <year>2015</year>
          , pp.
          <fpage>2999</fpage>
          -
          <lpage>3007</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Berger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pieris</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Sallinger,</surname>
          </string-name>
          <article-title>The space-eficient core of vadalog</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>47</volume>
          (
          <year>2022</year>
          ) 1:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          :
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shkapsky</surname>
          </string-name>
          , T. Condie,
          <article-title>Fixpoint semantics and optimization of recursive datalog programs with aggregates</article-title>
          ,
          <source>CoRR abs/1707</source>
          .05681 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Al Khalaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alfonsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Bernasconi,</surname>
          </string-name>
          <article-title>CoV2K: A Knowledge Base of SARSCoV-2 Variant Impacts</article-title>
          , in: S. Cherfi,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perini</surname>
          </string-name>
          , S. Nurcan (Eds.),
          <source>Research Challenges in Information Science</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>274</fpage>
          -
          <lpage>282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Hodcroft</surname>
          </string-name>
          ,
          <source>CoVariants: SARS-CoV-2 Mutations and Variants of Interest</source>
          ,
          <year>2022</year>
          . URL: https://covariants.org/, last accessed: June 30th,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Public</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>England</surname>
          </string-name>
          , COVID-19
          <source>variants: genomically confirmed case numbers</source>
          ,
          <year>2022</year>
          . URL: https://www.gov.uk/government/publications/ covid-19
          <string-name>
            <surname>-</surname>
          </string-name>
          variants
          <article-title>-genomically-confirmed-case-numbers, last accessed</article-title>
          : June 30th,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>The</surname>
            <given-names>COVID</given-names>
          </string-name>
          -19
          <string-name>
            <surname>Genomics</surname>
            <given-names>UK</given-names>
          </string-name>
          (
          <article-title>COG-UK) consortium, An integrated national scale SARSCoV-2 genomic surveillance network</article-title>
          ,
          <source>The Lancet. Microbe</source>
          <volume>1</volume>
          (
          <year>2020</year>
          )
          <article-title>e99</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>European Centre for Disease Prevention and Control, SARS-CoV-2 variants of concern, 2022</article-title>
          . URL: https://www.ecdc.europa.eu/en/covid-19/variants-concern, last accessed: June 30th,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>