KARaML: Integrating Knowledge-Based and
Machine Learning Approaches to Solve the
Winograd Schema Challenge
Suk Joon Hong1 , Brandon Bennett2 , Judith Clymo2 and Lucía Gómez Álvarez3
1
  InfoMining Co., South Korea
2
  University of Leeds, UK
3
  TU Dresden, Germany


                                         Abstract
                                         The Winograd Schema Challenge (WSC) is a commonsense reasoning task introduced as an alternative
                                         to the Turing Test. While machine learning approaches using language models show high performance
                                         on the original WSC data set, their performance degrades when tested on larger data sets. Moreover,
                                         they do not provide an interpretable explanation for their answers. To address these limitations, we
                                         present KARaML, a novel asymmetric method for integrating knowledge-based and machine learning
                                         approaches to tackle the WSC.
                                             A central idea in our work is that semantic roles are key for the high-level commonsense reasoning
                                         involved in the WSC. We extract semantic roles using a knowledge-based reasoning system. For this,
                                         we use relational representations of natural language sentences and define high-level patterns encoded
                                         in Answer Set Programming to identify relationships between entities based on their semantic roles.
                                         We then use the BERT language model to find the semantic role that best matches the pronoun. BERT
                                         performs better at this task than on the general WSC. We apply our ensemble method to a restricted
                                         domain of the large WSC data set, WinoGrande, and demonstrate that it achieves better performance
                                         than a state of the art pure machine learning approach.

                                         Keywords
                                         Winograd Schema Challenge, Knowledge Representation, Machine Learning, Semantic Roles, Natural
                                         Language Understanding, Answer Set Programming, BERT


1. Introduction
The Winograd Schema Challenge (WSC) is a commonsense reasoning test proposed in [1] to
demonstrate whether a machine is “capable of producing behaviour that we would say required
thought in people”. The task of the WSC is to resolve which noun a pronoun refers to in a
given sentence. Winograd schema (WS) examples are typically written in pairs (which we call
Winograd schema pairs). These differ in only a few words, called the special and the alternate

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE
2022), Stanford University, Palo Alto, California, USA, March 21–23, 2022.
" hsplus89@gmail.com (S. J. Hong); B.Bennett@leeds.ac.uk (B. Bennett); J.C.Clymo@leeds.ac.uk (J. Clymo);
lugo476b@tu-dresden.de (L. Gómez Álvarez)
~ bb-ai.net (B. Bennett)
 0000-0001-5020-6478 (B. Bennett)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Suk Joon Hong et al. CEUR Workshop Proceedings                                              1–16


words. Two candidate nouns are given alongside each schema as possible referents of the target
pronoun (the same candidates for each schema in the pair), and the pronoun must be resolved
in opposite ways depending on which of the special or alternate words was used. The use
of schema pairs is intended to ensure that syntactic clues cannot help in finding the referent
of the pronoun. Instead, this must be done by using world knowledge and reasoning. The
original set of WSs, known as WSC273 [1], contains only 273 instances, but more recently a
dataset of around 44000 following the same style was developed through crowd-sourcing [2].
An example from WSC273 is given below. Large and small are the special and the alternate
words respectively:

    • The trophy doesn’t fit in the brown suitcase because it is too large.
      the trophy (answer) / the suitcase.
    • The trophy doesn’t fit in the brown suitcase because it is too small.
      the trophy / the suitcase (answer).

   Although the instigators of the WSC had originally envisaged that formalised theories of
commonsense knowledge would be required to address the challenge [1], it has been tackled
by a wide variety of approaches and has highlighted some serious difficulties that arise for
Knowledge Representation (KR) approaches when applied to unconstrained, general problems
of natural language understanding. By contrast, language models based on Machine Learning
(ML) have achieved relatively good performance on WSC test sets although they do not employ
any explicit representation of the detailed knowledge that seems to be involved in resolving
WSC problems. Despite this success, the language model approaches have some weaknesses.
Current language model methods are brittle, in that results are sensitive to small changes in
the way a problem is expressed that are irrelevant to its solution. Language model approaches
to the WSC so far do not provide any justification for the answers they give. As the WSC is
supposed to test ‘understanding’, this is a significant limitation.
   Our current work explores a combined KR and ML approach to the WSC. We call our system
KARaML, standing for Knowledge Assimilation based on Roles and Machine Learning, and use
the semantic roles of the agents participating in the situation described to resolve the WSC
problem. We use the semantic parser K-Parser [3] to extract a relational semantic representation
of the schema, and ASP-based rules to determine semantic roles of the candidate nouns. We then
uses the language model BERT [4] to match the pronoun to one of the extracted semantic roles.
This allows us to leverage the implicit knowledge in the language model and so avoid manually
building or attempting to explicitly learn a large knowledge base. By using the language model
in a more focused way, rather than asking it to solve the whole task, our system is able to avoid
some of the fragility commonly displayed by language models, and can provide an explanation
alongside its decision. We have tested our approach on a subset of the large WSC data set,
WinoGrande [2] and found that it performs better than pure ML methods using BERT [4, 5].


2. Related Work
Winograd schemas have been tackled by both KR and ML approaches. A typical KR approach
would aim to resolve a WS by first translating the textual form of the schema into a logical


                                                 2
Suk Joon Hong et al. CEUR Workshop Proceedings                                                  1–16


representation, then combining this with additional axiomatised background knowledge and
using rules of inference to deduce the reference of the pronoun. Early work on AI systems for
natural language understanding by Hobbs [6] proposed formalised principles of coherence that
can account for co-references in many cases. However, he noted that in some cases, establishing
the reference of a pronoun also requires detailed background knowledge. Indeed, the solution of
most WS examples appears to involve knowledge concerning particular physical and/or social
situations and understanding of vocabulary terms as well as general principles of communication
and inference.
   Sophisticated formal frameworks such as Segmented Discourse Representation Theory [7]
have been developed in order to explain the logic underlying coherence and co-reference.
However, the complexity of such theories has been an obstacle to their implementation in
practical applications. Kehler et al. [8] and subsequently Bennett [9] gave formal analyses that
account for certain WS cases. Schüller [10] presented a general method based on relevance
theory and knowledge graphs. But the level of detail required to model knowledge relevant to
specific cases suggests that the extension of these kinds of approach to incorporate sufficiently
comprehensive knowledge to give general coverage of WS problems would be an enormous
task. Bailey et. al. [11] proposed a ‘correlation calculus’, which uses first-order logic with a
novel correlation connective, to resolve WSs. This offers the prospect of a more general form of
KR-based solution in which the complex types of correlation involved in solving the WSC might
be inferred from simpler assumptions but would still require large numbers of basic correlations
to be represented in order to cover the huge variety of possible WS problems.
   A possible way to make KR approaches more effective for particular problem types may
be to focus on aspects of semantics that are especially salient for those problems. We believe
that the notion of ‘semantic role’ is such an aspect, which is often decisive in establishing
co-reference and hence in solving WS problems. Semantic Role Labelling (SRL) is considered
to be a significant computational task for natural language understanding, and can be carried
out with high accuracy by some existing systems (such as SENNA [12]), and a method of
using semantic roles for co-reference resolution is described in [13]. In NLP semantic roles are
primarily defined in terms of the linkage of noun phrases to verbs (e.g. as ‘subject’, ‘object’ etc.).
However, in the current paper we advocate a more general idea of semantic role that is held in
relation to an activity (e.g. helping, needing help) and is not strictly tied to particular verbs and
grammar. This idea of semantic role is akin to that adopted in Frame Semantics [14].
   Many systems have been developed that can translate from natural language text into some
form of logical representation [15, 16]. This ‘semantic parsing’ task is extremely challenging and
the results obtained are unreliable, especially for complex sentences such as those occurring in
WSs. Nevertheless, the extracted representations do identify entities, properties, relationships
and logical structures that can be processed by KR-based reasoning systems. Sharma [17]
developed a semantic parser, K-Parser [3] to transform schemas into relational representations,
and uses these to resolve WSs. This method enhanced the extracted semantic content using rules
formulated in Answer Set Programming (ASP) [18]. The initial use of this method for solving
WS problems also required hand crafted representation of relevant background knowledge.
The method shows accuracy of around 80% on the original WSC273 set when relational
representations of both schemas and background knowledge principles are manually created.
To address the problem of encoding sufficient knowledge to cover a wide class of commonsense


                                                  3
Suk Joon Hong et al. CEUR Workshop Proceedings                                               1–16


reasoning problems, various automatic knowledge extraction techniques have been employed.
Sharma was able to achieve a more automated solution by extracting background knowledge
using Google search to obtain identity rules enabling pronoun resolution [17]. However, fewer
than half of the required rules could be obtained by this automated method.
   In our previous work [19] we built on Sharma’s method [20]. We used K-Parser with additional
hand-coded ASP rules to extract semantic roles of the candidate nouns, similar to the pattern-
based semantic relation extraction of Al-yahya et al. [21]. Further logical rules were then used
to determine the pronoun’s referent based on its semantic role and those of the two candidates.
   Regarding ML approaches, Rahman and Ng [22] obtained promising results using an SVM
ranker based on a variety of linguistic features, both semantic and syntactic. More recently,
approaches based on neural network language models have made significant progress on the
WSC task. Using the BERT [4] language model, high accuracy for resolving WSs has been
demonstrated [23, 5, 2], with up to 90% accuracy reported for WSC273 [2]. Using the BERT
variant RoBERTa, which has been found to perform better on many tasks, similarly high accuracy
has been obtained [24, 2].
   However, it is too early to claim that machines have reached human-like ability to resolve
Winograd schemas. WSC273 is a very small test set, and accuracy has been found to decrease
by around 10% or more on larger WSC-like data sets. Consequently, some researchers have
suggested that the strong performance on WSC273 may overstate the capability of neural
language models to carry out commonsense reasoning tasks [2, 25]. Tests that focus on cases
involving compositional logical structure indicate that BERT does not work well in relation
to function words such as negation [26]. BERT also seems to lack robustness with respect
to irrelevant small variations: simply changing proper names can cause it to give incorrect
answers to some WSs which were previously answered correctly [23, 19]. This suggests that
language models may work by recognising features that are, at least in some cases, only
indirectly connected with genuine understanding of WS problems. This also relates to issues
of transparency and explainability. Humans would expect answers to be based on general
principles, whereas current methods based on language models do not provide any meaningful
explanation for their answers. Whereas humans appear to employ both commonsense reasoning
and intuition [27], neural language models seem to work in a way that is more similar to intuition
than to logical reasoning.
   In this paper we attempt to develop a new way to combine KR and ML in order to address the
WSC and contribute to exploration of the general problem of natural language understanding.


3. Winograd Schema Structure and Semantic Roles
In this section we examine the syntactic and semantic structure of Winograd Schema problems
in order to motivate and explain our resolution method.

3.1. Schema Structure
A WS is a sequence of tokens in which three (non-overlapping) sub-sequences are indicated:
two words or phrases referring to ‘candidate’ entities and one pronoun (normally a single word).
Thus, it is an expression of the form 𝜓(𝑎, 𝑏, 𝑝), whose meaning constrains the references the


                                                 4
Suk Joon Hong et al. CEUR Workshop Proceedings                                                                  1–16


candidate terms 𝑎 and 𝑏 and the pronoun 𝑝. For the expression to be considered satisfactory as
a WS, any reasonable human being should either infer from it that 𝑝 refers to the same thing as
𝑎, or infer from it that 𝑝 refers to the same thing as 𝑏.
   In nearly all WS examples, there is a clear division into two propositional components, with
the first component describing a situation involving both candidates, 𝑎 and 𝑏, and the second
giving information involving 𝑝. Hence, a WS normally has the structure 𝜑(𝑎, 𝑏) # 𝜋(𝑝), where
‘#’ represents the type of connection between the two parts. In many cases the two parts are
separate sentences. For these cases we can treat the connective as logical conjunction (although
temporal sequence may also be implied). In other cases, the halves may be connected with
words such as ‘and’, ‘because‘, ‘although‘, ‘since‘ etc. The particular connective is relevant to
pronoun resolution.1
   So the pronoun resolution problem has the following form:

                            ( 𝜑(𝑎, 𝑏) # 𝜋(𝑝) ) ∧ 𝑝 = (𝑎|𝑏) ) ⇝               (𝑝 = 𝜅) ,                            (1)

where ⇝ represents some kind of rational inference relation and 𝜅 is either 𝑎 or 𝑏. The
presupposition that 𝑝 must be identified with exactly one of the two candidates, is represented
by the notation 𝑝 = (𝑎|𝑏).
   Given that we need to infer an identity between 𝑝 and either 𝑎 or 𝑏, there must be some aspect
of the content of 𝜑(𝑎, 𝑏) which can be linked to the content of 𝜋(𝑝) in such a way that either 𝑎
or 𝑏 can be distinguished as the more likely co-referent of 𝑝. One way to approach this would
be to tease out from 𝜑 what is said individually about each of the candidates and try to link
that to 𝜋. Indeed, by means of semantic intuitions or by using an automated semantic parser, a
given proposition 𝜑(𝑎, 𝑏) can typically be analysed into a combination of simpler components,
𝛼(𝑎) ∧ 𝛽(𝑏) ∧ 𝜌(𝑎, 𝑏) ∧ 𝛾, where 𝛼 and 𝛽 represent conditions that are individually ascribed
to candidates 𝑎 and 𝑏 respectively, 𝜌 represents whatever information is asserted about the
relationship between 𝑎 and 𝑏, and 𝛾 is any additional information that does not directly involve
𝑎 or 𝑏. More specifically, each of the components 𝛼, 𝛽, 𝜌, 𝛾, may correspond to a (possibly
empty) set of predicates in the semantic analysis.
   In ordinary natural language, there are many examples where the reference of the pronoun
can be resolved just by considering the individual properties of potential candidates (𝛼 and 𝛽).
Lesvesque et. al. [1] consider the example ‘The women stopped taking the pills because they were
[pregnant/carcinogenic]’. However, although this seems to be a typical use of a pronoun, it is not
considered to be a good schema. Lesvesque et. al. explicitly say that this is a poor example, since
correct resolution can be determined just by considering the types of the candidates (‘women’
and ‘pills’) and the types of entity of which the attributes ‘pregnant’ and ‘carginogenic’ could
be predicated. Such cases are considered too easy to demonstrate that intelligence is required
to resolve them. They suggest that suitably difficult WS examples must require understanding
of the situation. This would typically involve the relationship between the candidates or some
property that is not merely a simple type attribute of one of the candidates.

    1
      For instance, in the WS273 set presented by Levesque et al. [1] includes the example ‘Pete envies martin
[because/although] he is successful‘, where swapping ‘because’ with ‘although’ changes the pronoun reference.
This case was also considered by [11], which suggests that, whereas ‘because’ implies positive correlation, ‘although’
implies negative correlation.’


                                                          5
Suk Joon Hong et al. CEUR Workshop Proceedings                                                1–16


3.2. Semantic Role Extraction
In the majority of cases we have examined, inferences based on individual properties 𝛼(𝑎) and
𝛽(𝑏) are not enough. In order to resolve the pronoun, one needs to extract further attributes of 𝑎
and 𝑏 from the roles they play in the relation 𝜌(𝑎, 𝑏). By introducing 𝑠 to stand for the situation
described by the relationship 𝜌(𝑎, 𝑏) (i.e. we reify the relationship), we can conceptually unpack
the relation into a conjunction 𝜌1 (𝑎, 𝑠) ∧ 𝜌2 (𝑏, 𝑠) ∧ 𝜔(𝑠) representing the semantic roles, 𝜌1
and 𝜌2 of the participants in relation to 𝑠, together with any other information (𝜔) attributed to
the situation. Furthermore, if we are concerned with distinguishing 𝑎 and 𝑏 in terms semantic
roles that occur in some particular types of situation (e.g. situations where one person helps
another), then the relevant role information can be represented by unary role properties, 𝜌1 (𝑎)
and 𝜌2 (𝑏) (e.g. 𝑎 gives help, and 𝑏 receives help).
   In fact, existing semantic parsers (such as K-Parser and SENNA) already assign role attributes
to referring constituents of sentences. However, these tend to be lacking in specific semantic
content and determined largely by syntactic features of their occurrence within the text. For,
example a referential word or phrase might be labelled as the ‘agent’, or ‘object’ of a verb. But,
like entity types, such basic role types can only be used to resolve pronouns in ‘easy’ cases.
In more complex cases, pronoun resolution requires understanding the way in which entities
participate in a situation; and this requires specific knowledge of the situation and the roles it
involves. Thus, we suggest that pronoun resolution in WSC problems requires an additional
role extraction mechanism (RE) going beyond an initial semantic parsing (SP) stage. Hence, the
semantic role extraction process can be represented by the following pattern:
                                     SP                                RE
     𝜓(𝑎, 𝑏, 𝑝) ≡ 𝜑(𝑎, 𝑏)#𝜋(𝑝) =⇒ 𝛼(𝑎) ∧ 𝛽(𝑏) ∧ 𝜌(𝑎, 𝑏) ∧ 𝛾 =⇒ 𝜌1 (𝑎) ∧ 𝜌2 (𝑏)                 (2)

  To illustrate our analysis, we consider the sentence “Maria is struggling with her exams and
asks for help from Rebecca, because she is already successful.”. Semantic parsing will produce a
formal representation similar to the following:

 (struggling_with_exams(𝑀 𝑎𝑟𝑖𝑎) ∧ ask_help(𝑀 𝑎𝑟𝑖𝑎, 𝑅𝑒𝑏𝑒𝑐𝑐𝑎) 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 successful(she)

from which we want to infer 𝑝 = Rebecca.
   In this example, 𝛼(𝑎) corresponds to the unary property struggling_with_exams(𝑀 𝑎𝑟𝑖𝑎),
𝜌(𝑎, 𝑏) is the relation ask_help(𝑀 𝑎𝑟𝑖𝑎, 𝑅𝑒𝑏𝑒𝑐𝑐𝑎)), the connective # is 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 and 𝜋(𝑝) is
successful(she). We have not specified any individual condition 𝛽 predicated of Rebecca, al-
though if we identified it as a proper name of a person (e.g. by using a named-entity recognition
system [28]) we could add the individual condition ‘Person(Rebecca)’. Role extraction rules, as
explained later in the paper, can then be employed to infer the semantic roles of the participants.

3.3. Resolving the Pronoun
The previous subsection examined the semantic structure of WSs and motivated the extraction
of semantic role attributes of the candidates from the first part of the schema. We now explain
how this can be used to identify the reference of the pronoun in the second part of the schema.
   Our general idea is related to the approach of Bailey et. al. [11], who proposed an extension
of first-order logic with a novel propositional connective. The statement 𝐴 ⊕ 𝐵 means that


                                                 6
Suk Joon Hong et al. CEUR Workshop Proceedings                                                    1–16


the truth of 𝐴 is positively correlated with the truth of 𝐵, in the sense that if a rational agent
becomes aware of the truth of either of the propositions they will consider the other proposition
more plausible than they would have in the absence of that information. The paper presents
a proof system to capture the logic of the ‘⊕’ operator and suggest that it can be used to
derive complex correlations from basic correlation assumptions and beliefs. Then these derived
correlations can be used for pronoun resolution. Assuming that what is said about the entity
via the pronoun reference is positively correlated with what is said about it in the candidate
phrase, we should be able to infer either 𝜑(𝑎, 𝑏) ⊕ 𝜋(𝑎) or 𝜑(𝑎, 𝑏) ⊕ 𝜋(𝑏) when given a schema
𝜑(𝑎, 𝑏)#𝜋(𝑝).
   The correlation calculus is proved to be sound with respect to statistical semantics. And,
although specification of the calculus predates the successful application of language models to
the WSC, it seems that it would be well suited to interfacing with a language model. Instead of
requiring correlations to be determined by axioms and logical reasoning, one could potentially
evaluate or compare degrees of correlation by means of language model responses.
   In our setting the relevant notion of correlation is a little different. We aim to find a correlation
between the role description of one of the candidates and the description involving the pronoun.
Also we look for a preferential rather than an absolute correlation. Thus, we wish to determine
which of the semantic roles of the candidates is more likely to apply to the pronoun, given what
is said regarding the pronoun. Hence, given the extracted roles 𝜌1 (𝑎) and 𝜌2 (𝑏) and assertion
𝜋(𝑝) regarding the pronoun, then if 𝑝 denotes 𝑎 we would expect the following inequality of
relative probabilities:

                               𝑃 (𝜌1 (𝑝) | 𝜋(𝑝)) > 𝑃 (𝜌2 (𝑝) | 𝜋(𝑝))                                (3)
   Note that what is said in the proposition 𝜋(𝑝) does not need to explicitly describe 𝑝 in terms
of either of the roles 𝜌1 or 𝜌2 ; it only needs to provide some reason to expect that one of the
potential facts 𝜌1 (𝑝) or 𝜌2 (𝑝) is more likely than the other.


4. Our Approach: KARaML
In this section we introduce our system KARaML. We use the semantic roles of the agents to
resolve the WSC, following the analysis in Section 3.
   Figure 1 illustrates the pipeline of our method to resolve WSs. KARaML uses a combination
of KR and ML methods to derive semantic roles of the candidates and pronouns by defining
domain-specific background knowledge relating to these high-level semantic roles. In the figure,
the element labelled ‘Semantic parsing & KR role derivation’ relates to Section 3.2 above, and
the element labelled ‘LM semantic role matching’ to Section 3.3. Finally, the ‘Semantic role
based reasoning’ component uses the previously derived knowledge to infer the solution. If
our combined system does not have suitable rules defined for resolving a schema, we simply
revert to using the language model alone. Other important features of our architecture are the
asymetric combination of KR and ML, and the selection of conceptually related sentences that
follow target patterns. We address these features in detail in the coming subsections.
   After addressing in detail the architecture of KARaML (in Section 5), we give results (in
Section 6) which show that where our combined reasoning method is applicable, we achieve


                                                   7
Suk Joon Hong et al. CEUR Workshop Proceedings                                                                  1–16


                                              WS is in domain (e.g. “helping”, “asking”, etc)
                                                                    Our knowledge-based reasoning method

                                 Semantic
                           Yes                                    Yes                           Semantic role
              Domain             parsing &             Pattern           LM semantic
   WS                                                                                              based
               filter             KR role               filter          role matching
                                                                                                 reasoning
                                 derivation
                  No                                       No


                                                                                                  Use LM


Figure 1: KARaML System Flow


better performance than using a language model alone.
   A major difference from our previous work [19] is that we no longer need detailed axiomati-
sations of the domain’s background knowledge to infer the semantic role of the pronoun, which
presented a challenge to scalability of the method. Instead, we will show that a minimal set of
high-level rules for the semantic roles, coupled with the usage of a language model, is enough
to obtain significant results.

4.1. An Asymmetric Combination of KR and ML
A notable feature of our approach is that we apply KR and ML methods asymmetrically with
respect to different parts of a WS. Specifically the KR mode of interpretation is focused on the
part of the WS that describes the candidates, whereas a neural language model is used to match
a correlated semantic role for the pronoun.
   A formal representation of a sentence has a rigid structure composed from specific symbolic
vocabulary. This means that if we have KR representations of two related pieces of information
(such as two successive sentences or clauses within a sentence) we can only draw inferences from
their combined content if we have some way of aligning them. This requires both combining
them in terms of a formal syntax, and also making explicit all significant semantic relationships
between the vocabulary of the two parts. When dealing with representations extracted from
natural language, this is a huge challenge. Not only are there an unbounded number of possible
situations that might be described, but even one situation could be described in a wide variety
of ways, using a wide variety of vocabulary terms. Hence, piecing together KR representations
extracted from different parts of a natural language text is extremely difficult, even when
connections are very clear to our intuitive understanding.
   By contrast, ML techniques are more malleable, in that they do not require exact matching in
order to connect one piece to another, so they can provide a mechanism for flexibly assimilating
or adjoining new information to an existing KR representation. Figure 2 illustrates the potential
advantage of this type of asymmetric combination.
   It may still seem puzzling why we are always focusing the use of KR on the left side of the
WS and reserving ML for interpreting the role of the right side. This is because our KR analysis
is designed to extract roles of the candidates in WSs and, in the majority of examples, these are


                                                      8
Suk Joon Hong et al. CEUR Workshop Proceedings                                                1–16


Figure 2: Knowledge Assimilation: KR+KR vs. KR+ML showing the advantage of ML’s flexibility


described primarily in the first part of the schema. In general, pronouns nearly always occur
after the noun or noun-phrase with which they co-refer. In most WS examples the pronoun
occurs in a following sentence or clause that does not usually make explicit the role of the
pronoun referent in a way that can be directly linked to the roles of the candidates. Nevertheless,
one might intuitively expect that there is a statistical correlation between the roles of the
candidates in the first part and what is then said using the pronoun in the second part. Indeed,
our results indicate that ML techniques can model this correlation.

4.2. Identifying Conceptually Related Sentences
In general, a WS may involve any vocabulary or domain of knowledge. This is problematic for
KR approaches, which require detailed logical modelling of knowledge and semantics. We use
keywords to identify restricted domains that are more manageable. Our aim is to provide a
simple method for selecting related schemas for which high-level background knowledge rules
can be defined.
   A small number of logical rules should be sufficient to explain a significant proportion of
schemas in a semantic domain. In particular we present in this paper a study of schemas
obtained by identifying instances containing the keyword ‘help’ (or ‘helping’, ‘helped’, ‘helpful’
etc.). We show that the same principles extend to a larger set including also schemas that
contain the keyword ‘ask’ (or ‘asking’, ‘asked’, etc.), for which only six additional rules needed
to be established. This shows that domains defined in this way are flexible and able to encom-
pass a variety of schemas. We have previously presented work on schemas containing the
‘thanking’ keyword [19]. Although our current work focuses on a few hand selected domains,
it demonstrates a general approach which could be extended to cover a larger proportion of
WinoGrande schemas.
   In our system, WSs are first filtered for use of keywords and then compared with high-level
patterns. It should be expected that there will be some overlap between domains, where a
sentence references multiple concepts. If a pattern is matched, this indicates that we have
suitable rules defined to understand this sentence. In the case that a sentence matches multiple
patterns we propose using the correlation between candidate and pronoun roles which is
identified as most significant by the language model (i.e. lowest loss).
   A sentence may use knowledge from a domain without containing the relevant keyword.
Provided that the sentences containing the keywords are representative of the domain and
allow us to generate appropriate rules, this is not a significant limitation. We anticipate that
our methods for identifying semantic roles may be extended to sentences which do not contain
the relevant keyword, allowing more sentences to be resolved using the existing rules.


                                                 9
Suk Joon Hong et al. CEUR Workshop Proceedings                                                     1–16


5. KARaML System Architecture
We now describe how we have implemented each component of KARaML from Figure 1. We first
tackle the domain filter, and subsequently we introduce an ASP pattern filter based on K-Parser
output. The pattern filter selects WSs that match certain semantic roles, for which domain-
specific background knowledge has been encoded in ASP. Next, BERT2 is used to determine
which of a pair of contrasting semantic roles the pronoun has a stronger correlation to. By
using the pattern filter together with the background knowledge we can infer the high-level
(and not necessarily explicit) semantic roles of 𝑎 and 𝑏 that BERT will choose from. Finally, the
derived semantic roles for the candidates and the best-matching role for the pronoun are used
to infer the final answer.
   In what follows we will give a detailed explanation of our system architecture and we will
use a sample schema from WinoGrande as a running example:

        Maria helped Elena cope with the newly diagnosed autism because she was inex-
        perienced with the disorder.
        Maria / Elena (answer)

In this case, Maria (𝑎) performs the semantic role of helper and Elena (𝑏) performs the semantic
role of being_helped. Our proposal is that the correlation between the semantic roles of the
candidates and the information we have about the pronoun is a good indicator for the pronoun
resolution. In the example, we note that a person being inexperienced in a situation is more
likely to explain (“because”) needing help than giving help.
   While the role Maria : helper can be derived with a relatively simple KR system, deriving
that an inexperienced person is in need of help (she : needing_help) previously required further
manually defined rules [19]. In our current work, this task is given to a language model, which
can make use of its implicit understanding of the correlation between inexperience and need.

5.1. Domain Filter
Our pipeline begins with a domain filter that identifies the schemas that may be associated with
a domain. A WS is passed into the filter, which determines whether it belongs to any of the
pre-defined domains by using keywords. Our running example will be categorised using the
“help” keyword. If a schema does not belong to any pre-defined domains, it is categorised as
out-of-domain and will be resolved by BERT.
   In our experiments, we begin by narrowing our attention to a domain centered around the
keyword “help”, which contains 1356 schemas. Subsequently, we target schemas containing
the keyword “ask”, amounting to 1753, which in fact respond to the same underlying patterns,
thus giving rise to a more general domain. Indeed, these sets of schemas intersect, which gives
further evidence that they share a common underlying semantic structure.


    2
     Specifically, we use BERT_WIKI_WSCR from [5] throughout. This is an instance of BERT which has been
additionally fine-tuned for the WSC.


                                                  10
Suk Joon Hong et al. CEUR Workshop Proceedings                                                  1–16


5.2. Parsing and Deriving Semantic Roles
The schemas that have been assigned to domains are parsed by K-Parser, which produces a
relational semantic representation of the input text containing qualitative information about
the words in the text (e.g. their conceptual classes, the relationship between predications and
their participants among other). Then, high-level semantic roles are derived using the output of
K-Parser together with our domain-specific background knowledge rules.
   Let us look at an excerpt from the parsed output of the sample schema:
    has_s( helped_2, agent, maria_1 ).
    has_s( helped_2, instance_of, help ).
    has_s( she_11, trait, inexperienced_13 ).
    has_s( inexperienced_13, instance_of, inexperienced ).
    has_s( cope_4, caused_by, was_12 ).

The output from K-Parser provides us with an initial representation of a given schema, which
is specified by means of predicates of the form has_s( node1, relation, node2 ). Sub-
sequently, the domain-specific rules are used to expand the output with relevant background
knowledge for the domain, which is mostly focused on the derivation of high-level semantic
roles. Two simple examples of such domain-specific rules are as follows:
    has_s( X, semantic_role, helper ) :-
                        has_s( Action, agent, X ),
                        has_s( Action, instance_of, help ).

    has_s( X, semantic_role, being_asked ) :-
                        has_s( Ask, recipient, X ),
                        has_s( Ask, instance_of, ask ).

  Using the first rule we can straightforwardly derive has_s( maria_1, semantic_role,
helper ) as desired for our running example.


5.3. The Pattern Filter
The parsed results together with the derived semantic roles of the schema are used as inputs to
the pattern filter. If a certain pattern is found by the pattern filter, that schema is to be resolved
by our combined framework. If not, just BERT is used for resolving the WSs.
   Our pattern filter exploits the generic structure given in formula (1) to select schemas that
follow recognised patterns, using the high level semantic roles previously inferred in section 5.2.
Patterns in our system are encoded in ASP and will typically fix semantic roles for one or more
of the agents 𝑎 and 𝑏, and possibly additional restrictions on other elements of the schema, such
as forcing # to be “because” and the pronoun to be a person (rather than an inanimate object).
   In this experiment we use a pattern to identify schemas where the roles of “helper” and “being
helped” are likely to be relevant for pronoun resolution. The filter checks whether the semantic
properties and relations extracted satisfy the conditions that: at least one of the candidate
expressions has one of these roles; the pronoun refers to a person (is “he” or “she”) which is the
agent of a verb in the sentence that has been identified as playing an explanatory role in the
situation. The relevant pattern is defined as follows:


                                                 11
Suk Joon Hong et al. CEUR Workshop Proceedings                                              1–16


help_pattern :-
       is_candidate( C ),
       1{has_s( C, semantic_role, helper );
         has_s( C, semantic_role, being_helped )},
       pronoun( P ),
       has_s( Verb, agent, P ),
       1{has_s( P, instance_of, he );
         has_s( P, instance_of,she )},
       has_s( _, caused_by,Verb ).

  From the 1356 schemas containing the keyword “help”, 207 satisfy this pattern, and from the
1753 schemas containing the keyword “ask”, 456 schemas match a similar pattern.

5.4. Using BERT to Identify Semantic Roles
In this phase, schemas that meet the pattern are given to BERT to extract an implicit semantic
role, where the possible roles are determined by the pattern that the schema matched. Previous
work (e.g. [5, 2]) uses language models such as BERT to evaluate the probabilities of each of the
candidates occurring as a replacement of the pronoun. So to resolve 𝑝 in 𝜓(𝑎, 𝑏, 𝑝) one would
replace 𝑝 with [MASK] and compare probabilities for [MASK] to be 𝑎 or 𝑏.

           𝑃 ( [MASK] = 𝑎 | 𝜓(𝑎, 𝑏, [MASK]) ) > 𝑃 ( [MASK] = 𝑏 | 𝜓(𝑎, 𝑏, [MASK]) )

   In contrast, we focus on deriving which of the candidate semantic roles is most correlated
to the information that is given about the pronoun. To derive this using BERT, we extract the
textual fragment 𝜋(𝑝) and concatenate it (following a period) with a basic sentence linking the
pronoun with the masked semantic role. In our experiment, we added the sentence “he/she
would [MASK] help”, where the [MASK] can be either give or need. Now BERT compares the
probabilities for [MASK] to be give or need, given the context 𝜋(𝑝) and the additional linking
sentence (𝑝 would [MASK] help).

𝑃 ( [MASK] = 𝑔𝑖𝑣𝑒 | 𝜋(𝑝). 𝑝 would [MASK] help ) > 𝑃 ( [MASK] = 𝑛𝑒𝑒𝑑 | 𝜋(𝑝). 𝑝 would [MASK] help )

We interpret BERT’s output as indicating the semantic roles “giving_help” and “needing_help”.
   Below we can see the contrast between the input to BERT as part of our Knowledge Based
strategy in contrast to a basic WSC resolution only relying on BERT:
    • “She was inexperienced with the disorder. she would [MASK] help.” (Our usage of BERT to
      derive a semantic role.)
    • “Maria helped Elena cope with the newly diagnosed autism because [MASK] was inexperi-
      enced with the disorder.” (Full WSC resolution using BERT. )

5.5. Reasoning Using Semantic Roles
This is the last phase of our system. For each schema that was matched to a pattern, the semantic
roles derived in sections 5.2 and 5.4 are used as inputs. With these inputs, some background
knowledge rules are needed to derive the referent of the pronoun. The background knowledge
rules we use are:


                                                 12
Suk Joon Hong et al. CEUR Workshop Proceedings                                             1–16


    • IF a person 𝑥 {helps / is asked by} a person 𝑦 because the pronoun 𝑝 is giving help THEN
      𝑝 refers to 𝑥.
    • IF a person 𝑥 {is helped by / asks} a person 𝑦 because the pronoun 𝑝 is needing help THEN
      𝑝 refers to 𝑥.

  The encoded forms of the background knowledge rules are given below as:
       answer(X) :- is_candidate(X),
           1 {has_s( X, helps, _ ); has_s( _, asks, X )},
           pronoun(P), has_s(Verb,agent,P),
           has_s(_,caused_by,Verb),
           has_s( P, semantic_role, giving_help ).

       answer(X) :- is_candidate(X),
           1 {has_s( _, helps, X ); has_s( X, asks, _ )},
           pronoun(P), has_s(Verb,agent,P),
           has_s(_,caused_by,Verb),
           has_s( P, semantic_role, needing_help ).

   Using the domain-specific background knowledge rules and the previously derived semantic
roles, we derive the answer for a WS. In our running example, statements expressing the implicit
semantic role of the pronoun (“needing_help”) from section 5.4 and the semantic roles of the
candidates from section 5.2 are added to the ASP program. Then, the condition defined in the
second background knowledge rule is satisfied, and thus we can derive the answer as “Elena”.
   Note that, although the implicit semantic role of the pronoun is extracted by an ML method,
the reasoning used to resolve the schemas is based on interpretable rules. Hence, the rules used
in resolving a schema provide an explanation of the answer.


6. Results
Table 1 shows the results of our method contrasted with two systems using BERT alone
(BERT_LARGE [4] and BERT_WIKI_WSCR3 from [5]) on the 207 sentences including ‘help’
and the 457 sentences containing ‘ask’ that meet the patterns. Our method achieves accuracy of
81.64% and 75.93%, which is higher than the accuracy achieved by BERT by around 5% and
13% respectively. Moreover, for each answer from our method an explanation can be produced,
in contrast to a mere quantification on the certainty of the choice as given by BERT.
   In our method, we use BERT to match a pronoun with an appropriate semantic role. We
checked the accuracy of BERT on this task for the sentences including ‘help’. BERT achieved
84.06%, which is higher than its accuracy in resolving Winograd Schemas directly by around 8%.
By integrating KR reasoning we not only increased the overall performance of the framework,
but also made better use of an existing language model’s ability.
   Further strengthening our claim that BERT benefits from being given a small, focused task, we
show that the accuracy for selecting the semantic role is affected by the exact prompt provided.
In our main experiments we gave BERT a short prompt based only on the pronoun part of the

   3
       This refers to BERT which has been further fine-tuned for the WSC.


                                                       13
Suk Joon Hong et al. CEUR Workshop Proceedings                                              1–16


                       Test                 Accuracy (‘help’)          Accuracy (‘ask’)
            T0.     BERT_LARGE            57.97%         (120/207)    51.86%    (237/457)
            T1.   BERT_WIKI_WSCR          76.33%         (158/207)    63.02%    (288/457)
            T2.      Our system          81.64%         (169/207)    75.93%    (347/457)
Table 1
Results on the sentences containing ‘help’ and ‘ask’ that meet the patterns


WS, and we later tested (for those containing ‘help’) the selection of semantic roles when a
longer prompt containing the whole WS was given. The accuracy reduced from 84.06% to
63.29% when using the whole context rather than the pronoun part only. For example,

    • “She was inexperienced with the disorder. She would [MASK] help.” (Our usage of BERT
      to match a semantic role.)
    • “Maria helped Elena cope with the newly diagnosed autism because she was inexperienced
      with the disorder. She would [MASK] help.” (Additional context given.)

   The finding also supports the view that the semantic role is often decisive in determining the
reference of a pronoun rather than other aspects of the semantic content.
   Note that the accuracy of our method as a whole (81.64%) for the schemas including ‘help’ is
only 2.42% lower than BERT’s accuracy in matching semantic roles. So, provided the pronoun
role is correctly identified, our KR reasoning is very accurate. The decreased accuracy of our
method compared to the semantic role prediction from BERT is mainly due to the fact that some
schemas were incorrectly parsed by K-Parser.


7. Conclusion
Our new method KARaML improves on the work of [19]. In our prior work it was necessary to
explicitly define a large number of rules in order to match the semantic roles for the candidates
and pronouns. Here we have significantly reduced the number of rules required by instead
using a language model to establish the correlation between the description of the pronoun and
the semantic roles of the candidates. In addition, we improve on the performance achieved by
BERT alone and we are able to generate an explanation for the chosen answer.
   Our current implementation can only be applied to a subset of Winograd schemas for which
domain-specific rules have been defined. For future work, if we include more domains and
patterns, we will increase the coverage of our system. We would also like to apply our method
to other language understanding problems such as COPA [29]. As some parsing results from
K-Parser were incorrect, we intend to investigate using other parsers such as SENNA.
   So far we have only made limited use of BERT to identify the likely semantic relationship
of the pronoun to the candidate clause. However, the same method may be applied to identify
other semantic relationships that could be exploited by a KR reasoner. Moreover, BERT could
be replaced by other state-of-the-art language models such as GPT-3 [30].
   More generally, our framework represents initial steps towards a progressive assimilation
architecture for language understanding where we use ML to successively combine new infor-


                                                   14
Suk Joon Hong et al. CEUR Workshop Proceedings                                             1–16


mation into a KR representation that we have built up from prior information. This seems to
provide a general way by which information expressed in natural language can be matched
with predicates occurring in the formalised axioms of a KR system.


References
 [1] H. Levesque, E. Davis, L. Morgenstern, The Winograd Schema Challenge, in: The 13th Int.
     Conf. on Principles of Knowledge Representation and Reasoning, Italy, 2012.
 [2] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, WinoGrande: An adversarial Winograd
     Schema Challenge at scale, in: AAAI-20, 2020.
 [3] A. Sharma, N. H. Vo, S. Aditya, C. Baral, Identifying various kinds of event mentions in
     K-Parser output, in: Procs. of the The 3rd Workshop on EVENTS: Definition, Detection,
     Coreference, and Representation, Assoc. for Comp. Linguistics, 2015, pp. 82–88.
 [4] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, arXiv:1810.04805[cs.CL] (2018).
 [5] V. Kocijan, A. M. Cretu, O. M. Camburu, Y. Yordanov, T. Lukasiewicz, A surprisingly robust
     trick for Winograd Schema Challenge, in: Procs. of the 57th Annual Meeting of the Assoc.
     for Comp. Linguistics, 2019, pp. 4837–4842.
 [6] J. R. Hobbs, Coherence and coreference, Cognitive science 3 (1979) 67–90.
 [7] N. Asher, A. Lascarides, Logics of Conversation, Cambridge University Press, 2003.
 [8] A. Kehler, L. Kertz, H. Rohde, J. L. Elman, Coherence and coreference revisited, Journal of
     semantics 25 (2008) 1–44.
 [9] B. Bennett, Semantic analysis of Winograd Schema no. 1, in: F. Neuhaus, B. Brodaric
     (Eds.), Procs. of the 12th Int. Conf. on Formal Ontology and Information Systems (FOIS
     2021), Frontiers in Artificial Intelligence and Applications, IOS Press, 2021.
[10] P. Schüller, Tackling Winograd Schemas by formalizing relevance theory in knowledge
     graphs, in: Fourteenth Int. Conf. on the Principles of Knowledge Representation and
     Reasoning, 2014.
[11] D. Bailey, A. Harrison, Y. Lierler, V. Lifschitz, J. Michael, The Winograd Schema Challenge
     and reasoning about correlation, in: Logical Formalizations of Commonsense Reasoning,
     AAAI Spring Symposium, Stanford University, USA, 2015.
[12] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language
     processing (almost) from scratch, Journal of Machine Learning Research 12 (2011) 2493–
     2537.
[13] F. Kong, Y. Li, G. Zhou, Q. Zhu, P. Qian, Using semantic roles for coreference resolution,
     in: 2008 Int. Conf. on Advanced Language Processing and Web Information Technology,
     2008, pp. 150–155.
[14] C. J. Fillmore, Frame semantics, in: Cognitive linguistics: Basic readings, De Gruyter
     Mouton, 2008, pp. 373–400.
[15] J. Bos, Wide-coverage semantic analysis with Boxer, in: Procs. of the 2008 Conf. on
     Semantics in Text Processing, STEP ’08, Association for Computational Linguistics, USA,
     2008, p. 277–286.


                                                 15
Suk Joon Hong et al. CEUR Workshop Proceedings                                                1–16


[16] D. Das, D. Chen, A. F. Martins, N. Schneider, N. A. Smith, Frame-semantic parsing,
     Computational linguistics 40 (2014) 9–56.
[17] A. Sharma, N. H. Vo, S. Aditya, C. Baral, Towards addressing the Winograd Schema
     Challenge - building and using a semantic parser and a knowledge hunting module, in:
     IJCAI 2015, 2015, pp. 1319–1325.
[18] M. Gelfond, V. Lifschitz, The stable model semantics for logic programming, in: Procs. of
     Int. Logic Programming Conf. and Symposium, 1988, pp. 1070–1080.
[19] S. J. Hong, B. Bennett, Tackling domain-specific Winograd Schemas with knowledge-based
     reasoning and machine learning, in: 3rd Conf. on Language, Data and Knowledge (LDK
     2021), 2021.
[20] A. Sharma, Using Answer Set Programming for commonsense reasoning in the Winograd
     Schema Challenge, arXiv:1907.11112[cs.AI] (2019).
[21] M. Al-yahya, L. Aldhubayi, S. Al Malak, A pattern-based approach to semantic relation
     extraction using a seed ontology, in: Procs. - 2014 IEEE Int. Conf. on Semantic Computing,
     ICSC 2014, 2014, pp. 96–99.
[22] A. Rahman, V. Ng, Resolving complex cases of definite pronouns: The Winograd Schema
     Challenge, in: EMNLP-CoNLL, 2012.
[23] P. Trichelair, A. Emami, A. Trischler, K. Suleman, J. C. K. Cheung, How reasonable are
     common-sense reasoning tasks: A case-study on the Winograd Schema Challenge and
     swag, arXiv:1811.01778[cs.LG] (2018).
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv:1907.11692
     (2019).
[25] A. Emami, K. Suleman, A. Trischler, J. C. K. Cheung, An analysis of dataset overlap on
     Winograd-style tasks, in: Procs. of the 28th Int. Conf. on Computational Linguistics, Int.
     Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 5855–5865.
[26] A. Ettinger, What BERT is not: Lessons from a new suite of psycholinguistic diagnostics
     for language models, Transactions of the Association for Computational Linguistics 8
     (2020) 34–48.
[27] D. Kahneman, Thinking, fast and slow, Penguin, London, 2012.
[28] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae
     Investigationes 30 (2007) 3–26.
[29] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice of Plausible Alternatives: An Evaluation of
     Commonsense Causal Reasoning, in: AAAI Spring Symposium on Logical Formalizations
     of Commonsense Reasoning, Stanford University, 2011.
[30] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language models are few-shot learners, arxiv:2005.14165[cs.CL] (2020).


                                                 16