Logical-Ontological Approach to Coreference Resolution

         Elena Sidorova1, Natalya Garanina1, Irina Kononenko1, and Alexey Sery1
 1
     A.P. Ershov Institute of Informatics Systems SB RAS, 6, Acad. Lavrentjev pr., Novosibirsk
                                           630090, Russia
                {lsidorova,garanina,alexey.seryj}@iis.nsk.su
                                         irina_k@cn.ru


         Abstract. We suggest a logical-ontological approach to the coreference resolu-
         tion in the process of text analysis and information extraction. Our approach
         solves the problem of comparing objects found in the text – instances of ontolo-
         gy classes — using the evaluation of the similarity of attributes and relations of
         objects. In object comparison, we take into account the discourse factors associ-
         ated with the text and the extra-textual characteristics presented in the ontology
         of the subject domain. Particularly, we consider polyadic relations which may
         represent the situations found in the text (events, processes, actions). We pro-
         pose the ontological interpretation of polyadic relations as classes with single-
         valued object properties. For coreference resolution we use information about
         objects and their relations. We propose the corresponding measures for evaluat-
         ing the semantic similarity of the participant objects in the relations.

         Keywords: ontology population, text analysis, information extraction, corefer-
         ence resolution, referential factors, polyadic relations.


1        Introduction

Identification of referential relations in discourse is one of the most vital but difficult
for modeling problems of automatic text analysis. Reference is a relation between
some text unit (language expression) and non-linguistic object, which is called a ref-
erent. Correct interpretation of an utterance in the text under analysis involves identi-
fication of the object mention referent, i.e. reference resolution. There is a range of
language means to mention certain referent in the text, and a speaker (text author)
makes choice between two opposite types of language expressions: full noun phrases
(proper names and descriptions) and reduced means of reference (pronouns and ana-
phoric zeroes). Processing expressions of the first type requires direct comparison of
extracted objects. In the second case, an anaphoric relation of the reduced expression
to antecedent expression is detected with respect to a number of text-structure, syntac-
tic, semantic and pragmatic conditions.
    The anaphora and coreference resolution is an important task within the framework
of automatic discourse analysis: machine translation, text summarization and infor-
mation extraction. The latter can be performed by natural language processing in
which certain types of information must be recognized and extracted from the text
2


(named entities recognition and fact extraction tasks, in particular). We consider the
coreference resolution within the framework of information extraction for ontology
population. In this framework, an ontology is used to represent the results of infor-
mation extraction, and knowledge presented in the ontology helps to solve specific
information extraction tasks.
   Solving the task of automatic ontology population involves addition of information
to the ontology repository. In [1] we consider mentions of simple entities and propose
an approach to their coreference resolution in the process of information extraction
for ontology population. An ontology structure allows to take into account implicit
information in the input text due to detecting relations between objects. In this paper,
we suggest coreference resolution for new objects with a complex structure including
situations (events, actions, processes), which are represented by polyadic relations in
an ontology. These situations extend the domain knowledge used for solving corefer-
ence resolution problem. The new knowledge improves the quality of coreference
resolution.
   In Section 2 we give a brief review of modern trends in the coreference problem
definition and the present research. In Section 3 we describe our basic approach to
ontology-based information extraction with formal definitions of and ontology and
polyadic relations. Section 4 presents ontological factors relevant for coreference
resolution illustrated by text examples and revises the similarity measure of objects.
In Section 5 we consider features of experiments in our approach. We conclude with
the base characteristics and advantages of the proposed approach and outline the di-
rections for future research.


2      Coreference in Information Extraction Tasks

We observe several classification aspects of problems related to the reference identifi-
cation.

─ First aspect is the way of presenting references in the text: full lexical expressions
  (noun phrases – proper names, descriptions, descriptions combined with proper
  names) or reduced expressions using anaphoric means (pronouns, determiners) or
  anaphoric zero. In the first case, for noun phrases based on proper names, the prob-
  lem is detecting identical references to named entities. In the second case, the prob-
  lem is identification of the antecedent, i.e. anaphora resolution [2, 3].
─ Second aspect is the type of the referenced object: referential identity of entities or
  situations (events).
─ Third aspect is the search area and type of context: the context of a single docu-
  ment (simple and complex sentences or chains of sentences in one text) opposes to
  cross-document analysis, in which references to the same object are looked for in
  the corpus or document flow.

   The traditional problem of anaphora and coreference resolution within a coherent
text remains to be relevant. Many early and modern researches solve the problem
using linguistic methods based on rules and methods of machine learning. R. Mitkov's
                                                                                         3


reviews [4, 5] and later [6, 7] consider the basic approaches to this problem. Recently,
there has been a growing interest in solving the problem in a broader perspective: not
only entities but also events or situations have been considered [8 – 12]. A cross-
document reference analysis that is an important approach for populating knowledge
bases and ontologies is used for the problem as well [8, 13 – 15]. The complexity of
the problem of coreference resolution requires an integrated approach, involving both
knowledge about the structure of the text (the level of discourse) and knowledge
about the subject area, which are determined by the classes of entities in a specific
ontology and their ontological structure (ontological level). In [16] the authors con-
sciously abstract away from the discourse factors of coreference in order to investi-
gate the role of subject knowledge. Discourse features represent the structural and
textual properties of mentions (similarity of sub-chains, position, distance), grammat-
ical and lexical features. Obviously, new tasks require a revision of the role of dis-
course features in comparison with ontological ones. Thus, cross-document analysis
does not consider pronominal anaphora and hardly takes into account such discourse
factors as the order of appearance of mentions in the text, and the distance (linear or
rhetorical).
   Theories of discourse analysis distinguish several types of discourse connectivity:
referential (identity of participants), spatial, temporal and event-triggered ones [17].
In applied research, there are two approaches to understanding the coreference of
events. In the first approach, two mentions of an event are considered coreferent if
they are characterized by the same set of properties (such as time or place of the
event) and the same set of participants [9 – 11]. In the second approach, only the ref-
erential identity of participants is considered for referential identity of events [3]). In
[12] a broader set of referential relations between two mentions of events is consid-
ered: complete coreference, subevents for vertices of the parent and child layers,
subevents for a descendant vertex of a single layer.
   We consider the problem of information extraction as a task of detecting all refer-
ences to objects of a given domain: entities and situations (events, states, actions,
processes). In the ontology population task, the found objects should be represented
as instances of concepts and relations of the ontology. It is necessary to establish ref-
erential relations between all instances found in the process of text analysis and in-
stances of the ontology information content (which does not exclude the possibility of
adding new instances to the ontology).


3      The Model of Information Extraction

Consider the environment in which our approach to coreference resolution is being
developed. Fig. 1 shows the general scheme of the information extraction system (IE-
system) with the emphasized module of coreference resolution.
   The input of our IE-system comprises: the ontology of a subject domain, the ontol-
ogy population rules and the results of preliminary text processing including the ter-
minological, thematic, and segment coverings of an input text.
4


   A terminological covering is the result of lexical text analysis which extracts terms
of a subject domain from a text and forms lexical objects using semantic vocabularies.
A segment text covering is a division of the text into formal fragments (clauses, sen-
tences, paragraphs, headlines, etc.) and genre fragments (document title, annotation,
glossary, etc.). A thematic covering selects text fragments of a particular topic. A
construction of a thematic covering is based on the thematic classification methods.
   The module of information extraction constructs objects representing instances of
concepts and relations of the domain ontology from the lexical objects [18]. This
module uses the ontology population rules which are automatically generated from
fact schemes. The fact schemes are formulated by experts taking into account the
ontology and language of a subject domain. These fact schemes constrain morpholog-
ical, syntactic, structural, lexical, and semantic characteristics of the objects.
   The coreference resolution module [19] runs in parallel with the information ex-
traction module. This module forms hypotheses about coreference relations, and cal-
culates their weights using various factors discussed below.


     Fig. 1. The scheme of the system of information extraction and ontology population.

The ambiguity resolution module resolves all types of conflicts which are the result of
various interpretations of the input text — different object text coverings for the same
text fragment. This module chooses the most informative variant from the set of pos-
sible interpretations (the variant with the highest weight) [20].
   The result of the work of our IE-system is the population of ontology content by
instances of concepts and relations of the subject domain found in the input text.
                                                                                           5


3.1    The Ontology of a Subject Domain
An ontology O of a subject domain includes the following elements:
─ a finite nonempty set CO of classes for representing the concepts of the subject
  domain,
─ a finite set DO of data domains, and
─ a finite set of attributes with names in AtrO = DatO∪RelO, each of which has values
  in some data domain from DO (data attributes or datatype properties in DatO) or
  has values as instances of some classes (object attributes or object properties in
  RelO, which model binary relations).

Each class c ∈ CO is defined by the set of its attributes: c = (Datc, Relc), where every
data attribute α ∈ Datc ⊆ DatO has the domain dα ∈ DO with values in Vd and every
                                                                               

object attribute ρ ∈ Relc ⊆ RelO has values from the subset Cρ ⊆ CO. The set of all
class attributes is denoted by Atrc = Datc ∪ Relc. We consider an ontology without
data and class synonyms, i.e. ∀ α1, α2 ∈ DatO: dα1 ≠ dα2 and ∀ c1, c2 ∈ CO : Atrc1 ≠ Atrc2.
   We denote the class of an attribute γ by cγ and the set of its values by Dγ. A set of
attributes of every class must include the nonempty set of key attributes AtrcK . The
key attributes can either be data or object attributes. These attributes guarantee unam-
biguous definition and uniqueness of the class instances.
   A tuple a = (ca, Data, Rela) is an instance of the class ca  ( Datc , Rel c ) (a ∈ ca)
                                                                         a         a

iff every data attribute  a  Data has a name   Datc with the values V from Vd
                                                          a                        a       

and every object attribute a  Rela has a name   Rel c with the values V as
                                                                a                      a

instances of the classes from Cρ.
   We use the standard class inheritance relation: the class c2 is a subclass of the
class c1 (c1 < c2) iff ∀ a ∈ c2: a ∈ c1.
   The information content ICO of the ontology O is a set of instances of the classes
from O. The ontology population problem is to compute information content for a
given ontology from the given input data.


3.2    Polyadic Relations
The notion of polyadic relation is not considered in the classical ontology theory. For
example, the OWL – the standard ontology description language – has no language
constructions for polyadic relations, only binary relations (Object Property) are avail-
able. On the other hand, polyadic relations frequently arise in the tasks of extracting
information from texts, because they can describe the propositional content of a
statement that represents an extra-linguistic situation, or state of affairs (event, action,
process, etc.).
   To overcome these shortcomings, we model polyadic relations (or just relations)
by ontology classes with constraints on the set of attributes. First, relations classes
have to include at least two object properties. Second, every object property of a rela-
6


tion has to be a key attribute. A polyadic relation may also contain datatype properties
without special constraints.
   Due to this definition, a polyadic relation is naturally represented by the set of bi-
nary relations. And vice versa, a binary relation can be represented by the polyadic
relation with two object properties as a special case of polyadic relations.
   In text processing, we consider polyadic relations correspond to descriptions of sit-
uations (actions, processes) and other objects with complex structure. The following
Table 1 gives some examples of polyadic relations extracted from texts.
   These examples relate to the automated control systems subject domain that in-
cludes such relation classes as Action, Process, Function, Control, Movement,
Change_of_state, etc. Object properties of relation classes correspond to the hierarchy
of semantic roles. The semantic role is a generalization of the functions of a partici-
pant in a range of situations denoted by a group of predicates, and hence the types of
corresponding situations.

                           Table 1. Examples of polyadic relations.

                Type: information_transfer
                Sender: X
     S1                                                     The system (Y) receives commands
                Recipient: Y
    Action                                                  (Z) from the operator (X)
                Message: Z
                Content: null
   S2           Agent: X2                                   The command (Z) is entered by the
                Type: processing                            operator (X2) through the remote
 Process
                Message: Z                                  operator console


3.3      The Coreference Resolution Problem
The information content of a text consists of a set of instances of ontology classes and
relations found in the text, which are provided with additional information.
   We define a set A of information-text objects (i-objects) retrieved from input data
and corresponding to ontology instances. Every i-object a∈A has the form (ca, Data,
Rela, Ga, Pa), where

─ ca∈ CO is the ontology class;
─ Data is the set of data attributes  a  ( ,V ) , where
                                                    a


        Datc is the attribute name, and V is the set of values v ∈ dα;
               a                                a

─ Rela is the set of object attributes a  (  ,V ) , where
                                                        a


        Rel c a is the attribute name, and V                is the set of i-objects of a class
                                                            a


        c a  C a ;
─ Ga is the grammar information (morphological and syntactic features based on
  grammar features of lexical object);
─ Pa is the structural information (a set of positions in the input data and the formal
  segments).
                                                                                             7


The attribute γ of the i-object a is filled if V   . We denote by Atra = Data ∪ Rela
                                                     a

the set of all attributes. Each i-object corresponds to some ontology instance in a natu-
ral way as follows. Let a = (ca, Data, Rela, Ga, Pa) be an i-object, then its correspond-
ing ontology instance is a′ = (ca, Data′, Rela′), and every α ∈ Data′ has value(s) in V
                                                                                         a

and every ρ∈ Rela′ has values in V .
                                          a

   We assume that i-objects a and b are possible coreferents a ≈ b (candidates for co-
reference) iff their classes are transitively related by the class inheritance relation and
the set of values of all filled key attributes of one i-object is included in the set of
values of the corresponding key attributes of the other i-object.
   The coreference resolution problem is to detect if given candidates for coreference
correspond to the same ontology instance.


4         Referential Factors

In previous papers [19], we considered two types of factors that affect the evaluation
of the measure of the coreferential similarity of two objects. First, discourse factors
(local textual and contextual) are determined by the language means used to represent
the objects in the text and by their location in the text structure. Second, semantic
factors determine the similarities of objects with respect to their ontological structure
and relations.
   In our approach, we distinguish logical-ontological factors for considering a set of
associated relations between objects. For these factors we use the properties of rela-
tions specified in the ontology.
   All these factors are used to evaluate similarity of objects mentioned in the text.
For each factor, we define a similarity measure. This measure corresponds to the de-
gree of strength of the coreferent relation between the i-objects a and b with respect to
the factor, without taking into account other factors.


4.1       The Coreferential Conflict and the Similarity Measure
We define coreferential conflict as a case when two non-coreferent i-objects a and b
are possible coreferents of the third i-object c: a ↭c b  (a ≈ c) (b ≈ c)  (a ≈ b).
   To determine which of these i-objects are actually coreferent, we use the measure
of coreference similarity of i-objects. This measure for i-objects a and b is denoted as
cs(a,b). If the non-coreferential i-objects a and b are possible coreferents for the i-
object c, we say that the coreferential conflict is resolved to a iff cs(a,c) > cs (b,c), i.e.
the i-object a is more similar to i-object c, then i-object b.
   The integral measure of similarity cs(a,b) is calculated as an Euclidean measure of
similarity based on four measures – semantic S(a,b), context C(a,b), position P(a,b)
and grammar G(a,b).

                    1
      cs (a, b)      (1  S (a, b))2  (1  C (a, b))2  (1  P(a, b))2  (1  G(a, b))2   (1)
                    2
8


The context similarity measure C(a,b) takes into account the information connectivity
of i-objects in a given text. This measure depends on the number of i-objects which
directly or indirectly use a) attribute values from both a and b, and b) attribute values
borrowed by a from b, and by b from a, for the evaluation of their own attributes.
   The position similarity measure P(a,b) takes into account variants of location of i-
objects in an input text. This measure depends on the number of segments, number of
possible candidates in the conflict, and number of lexemes placed between the
positions of a and b.
   The grammar similarity measure G(a,b) is based on the standard linguistic features
such as gender, number, person, etc.
   The semantic similarity measure S(a,b) determines the degree of proximity of the
corresponding attribute sets Atra and Atrb. Comparing these two sets takes into ac-
count both the similarity of the values of their constituent elements and additional
characteristics based on the ontological properties of attributes, including the inher-
itance of classes and data attributes, intersection, union, composition, refinement,
inversion, inclusion, closure, transitivity and symmetry.
   In [1] we consider 11 types of similarities. Below we expand this set with
similarities using polyadic relations. Initially, S(a,b) was determined by formula (2),
where Simb  {( a ,  b ) | sim( a ,  b )  0} :
            a


                                        1
                     S ( a, b )                      sim( a ,  b )
                                    | Simba | ( a ,  b )Simba
                                                                                       (2)


Here, under the sign of the sum, all kinds of similarities of the attributes of the objects
a and b are collected. Practical considerations and experimental data revealed particu-
lar cases in which basic formula (2) is inexact and instable with respect to adding new
attribute comparison characteristics: i-objects that have a large set of comparable but
actually not similar attributes can turn out to be close with each other due to just tak-
ing into account that the similarity of attributes that is greater than zero. It is worth
noting that such cases are very rare due to the definition of coreference and the formu-
lation of the problem of extracting i-objects. The second disadvantage of formula (2)
is expressed by the fact that adding new terms to the sum can decrease the total value.
But one should expect that positive additional information about the proximity of
attributes have to always increase the similarity of the corresponding i-objects. These
additional characteristics are based on the ontological properties of attributes, includ-
ing, in particular, composition, transitivity, refinement, etc., and specials properties of
polyadic relations described below. In view of the above, it was proposed to convert
formula (2) to a formula of the following form:

                         S (a, b)  S EQ  (1  S EQ )  S                            (3)

The value SEQ[0;1] corresponds to the similarity of the values of the corresponding
attributes of the objects a and b without taking into account the additional characteris-
tics, and S[0;1) — the additional information provided by these characteristics.
                                                                                     9


  SEQ is calculated by formula (4), similar to formula (2), where the set of pairs of
                         a
similar attributes Simb is replaced by the set of pairs of comparable attributes
Compba  {( a ,  b ) |  a  Atra ,  b  Atrb ,   }.

                                  1
                     S EQ                         sim( a ,  b )
                              | Compba | ( a ,  b )Compba
                                                                                    (4)


Only measures of standard similarity of attributes by values stand under the sign of
the sum in the formula (4) [19].
   Let the total amount of additional information about the attributes of objects a and
b be

                             I            sim ( ,  )
                                                      

                                   a Attra , b Attrb
                                                          a   b                     (5)


Here the symbol  denotes additional properties of attributes, such as transitivity,
composition, etc. It is obvious that I can take any positive values. Hence, in order to
get the value of S varying from 0 to 1, we need a monotonic transformation defined
everywhere on the positive semi-axis. Using I, we evaluate the additional similarity of
the i-objects a and b. Really, we determine the value of the probability of this simi-
larity S:

                                                  I
                                       S                                          (6)
                                                 1 I
We can see from formulas (3), (5) and (6) that

─ S(a,b) = 1  SEQ = 1,
─ S  [0;1), and
─ S(a,b) > SEQ  SEQ < 1  S > 0.

In other words, when objects have incomplete similarity in the values of comparable
attributes, and the additional information is available, the degree of similarity S is
always greater than SEQ, but full match is achieved only under the condition that the
values of all comparable attributes are the same taking coreference into account.


4.2    Relations Factor
For evaluating similarity we consider polyadic relations in the following two aspects.
   First, comparing polyadic relation instances for identification coreference between
them.
   Example 1. When the bottle reaches a certain position, (the sensorX communicates
with the conveyor Y)S1 to inform it that it should stop. For this purpose (the sensorX
sends a signal StopZ to the receiving device of the conveyorY)S2
10


   In this example, we can distinguish two possible coreferent instances of polyadic
relations S1 and S2:

 S1: Contact (Originator: X, Recipient: Y)
 S2: Information_transfer (Originator: X, Recipient: Y, Content: Z)

These instances are similar because their Originator and Recipient attributes have
coreferent values.
   Second, using information about polyadic relations for identification coreference
between i-objects participating in these relations. For this purpose, pairs of relations
are considered that contain similar values (besides the objects themselves being com-
pared). Change the example from the previous version.
   Example 2. (The sensorX1 transmits a messageZ to the conveyorY)S1 to inform it that
the bottle has reached a certain position. So, (itX2 controls the operation of the con-
veyorY)S2.
   In this example polyadic relations are represented by the following instances:

 S1: Information_transfer (Originator: X1, Recipient: Y, Content: Z)
 S2: Control (Controller: X2, Patent: Y)

We consider the instances X1 and X2 are similar because S1 and S2 have a similar
value Y. Note that in the last example the relations of different classes with different
sets of object attributes are compared because we allow the comparison of arbitrary
relations.
   We define the following formal ontological properties for object attributes. They
are used for definition of object similarity measures that take into account polyadic
relations. We borrow some concepts of relational algebra. We denote the set of all
polyadic relations of the ontology O by SO.
   Definition 1. Let ρ, ρ′, ρ′′ ∈ RelO.

─ The attributes ρ, ρ′ are in the projection relation ρ= ρ′ iff Cρ, Cρ’ ⊆ SO and ∃
        γ1,…, γm , ′ γ′1,…, γ′m):∀ a∈ c∈ Cρ ∃ a′ ∈ Cρ’: π a π ′ a′ , i.e. Vγia = Vγ′ia′
  (i∈[1..m]), and vice versa, ∀ a′∈ c′∈ Cρ′ ∃ a ∈ Cρ: π ′ a′ π (a), i.e. the values of
  the attributes that are in the projection relation are instances of the polyadic rela-
  tions that contain equal values.
─ The attributes ρ, ρ′ and ρ′′ are in the natural join relation ρ=ρ′⋈ρ′′ iff Cρ, Cρ′, Cρ′′⊆
  SO and ∀ a′∈ c′∈ Cρ′, ∃ a ∈ c ∈ Cρ, A ⊆ Atra: π tra′ a′ πA(a), ∀ a′′∈ c′′∈ Cρ′′ ∃ a ∈
  c ∈ Cρ, A ⊆ Atra: π tra′′ a′′ πA(a), and ∀ a∈ c∈ Cρ,b∈ Atra : (∃ a′∈ c′∈ Cρ′, b’∈ At-
  ra′ : b= b′ ∨ (∃ a′′∈ c′′∈ Cρ′′, b′′∈ Atra′′ : b b′′ , i.e. the instances that are the values
  of the object attributes ρ′ и ρ′′ are complementary different views projections on
  the values of the attribute ρ.

Thus, the projection describes a subset of the common elements of the relation in-
stances. In Example 1, the common projection of instances of the relations S1 and S2
is {X, Y}. In Example 2, the corresponding projection is the set {Y, X1, X2}. The natu-
ral join takes into account the presence of a third relation when comparing a pair of
relation instances. This relation includes the join of the attributes of these relations.
                                                                                        11


The presence of such a third relation is an evidence of the information included in the
first two ones.
    The example of the ontological natural join relation is ontological description of
the modules of a technological complex that execute the similar tasks. Each module is
represented by a relation, including instances of the tasks: SMi (w1, …, wn). The com-
plex performs the whole set of tasks, which is the result of the natural join of the tasks
executed by the modules: ∪ wij, wij SMi.
    For those cases when properties of attributes in Definition 1 cannot be derived
from the ontology description, there is a need to check the necessary conditions of the
presence of the properties. The following proposition formulates these conditions in a
constructive way. We denote the necessary condition of a property x by 𝒩x.
    Proposition 1. Let ρ, ρ′, ρ′′ ∈ RelO.

─ ρ= ρ′ ⇒𝒩 = (Cρ∩iCρ’ ≠ ∅);
─ ρ=ρ′⋈ρ′′ ⇒𝒩⋈ = (Cρ′⋃ i Cρ′′⊆ i Cρ).

Here, the superscript i in the set operations means that we make the operation over the
elements of the sets and over their parental classes and subclasses in the class hierar-
chy. The proof follows from Definition 1.
   Taking into account Definition 1, we define the projection and natural join based
similarities of the attributes. We also define the class similarity. In the following defi-
nition, the superscript r in comparison operations and calculation of the power of sets
means that the operations consider the elements of the sets and their possible corefer-
ents.
   Definition 2. For i-objects a and b with a ≈ b and ca ≤ cb, we compute the power of
the class similarity as simc(ca, cb) = |cb|/|ca|, where |cx| is the number of subclasses of
the class x including x itself.
   Definition 3. For i-objects a and b, we consider object relation ρ∈ Rela and
ξ ∈ Relb with ρ, ξ∈ SO is

─ projectionally similar ρ∼ ξ, iff ρ= ξ∨ 𝒩 and Sπ=∪x∈Vρa {X⊆ Atrx | ∃ y∈Vξb, Y⊆
  Atry : πX(x)= r πY y } ≠ ∅. The power of the projection similarity is simπ ρ, ξ
    ½|Sπ|( c(Vρa)-1 + c(Vξb)-1), where c(Vμ ∑z∈ Vμ∑γ ∈ Atrz |Vγ |r.
─ joinly similar ρ∼⋈ ξ, iff ∃ μ: μ ρ⋈ξ ∨ 𝒩⋈ and S⋈ = {(x, y) | x∈Vρa, y∈Vξb, ∃ z∈
  Cμ, Zx⊆ Atrz, Zy⊆ Atrz: Atrz ⊆ Zx∪ Zy, πAtrx(x)= r πZx z and πAtry(y)= r πZx z } ≠ ∅.
  The power of the join similarity is sim⋈ ρ, ξ ½|S⋈|((|Vρa|r)-1+(|Vξb|r)-1).

Thus, we can take into account the power of simc, simπ and sim⋈ of the projection and
join similarity in the semantic similarity measure along with the other factors in for-
mula (5). This allows us to take the context into account more accurately, improving
the quality of information extraction.


5      Characteristic of Experimental Study

The proposed approach to resolving coreference is based on the properties of the do-
main concepts presented formally. Testing its implementation requires for a formally
12


presented ontology of a subject domain, as well as text corpus annotated in accord-
ance with the ontology. Typed coreferential relations also have to be annotated.
   There exist coreferentially annotated corpora for English (MUC) and a number of
other languages (Catalan, Dutch, English, German, Italian, Spanish, Czech, Chinese
and Arabic). The first open corpus for Russian is RuCor (available at
http://rucoref.maimbava.net/) that represents anaphorical and coreferential relations
and morphological annotation. RuCor contains about 200 texts of different genres
(primarily news, essays, and fiction) that do not correspond to any special subject
domain [21]. The lack of appropriate datasets with deep layers of annotation is the
obstacle to the study of complex cases of coreference.
   Hence, for evaluation of our approach we form a corpus of examples with a com-
plex type of coreference, which can be resolved on the basis of ontology. Several
examples are selected for each type of ontological relation. The total volume of the
corpus is about 50 text fragments taken from texts of technical documentation and
encyclopedias. These fragments represent specifications of requirements from the
subject domain of automated control systems. Each example is annotated by corefer-
ence relations with types based on ontological properties.
   We consider such annotation of coreference information necessary for further lin-
guistic research. Extending the capabilities of automatic analyzers with computation-
al similarity models based on ontological properties improves the quality of corefer-
ence resolution. Thus, for the examples found, the use of logical-ontological measures
allows to increase the measure of similarity of the “correct” variant by 0.05-0.1 (5-
10%).


6      Conclusion

In the papers on the topic of coreference resolution, we proposed a formal statement
of the problem and mathematically-strict definitions of the notions of coreference,
coreferential conflict and ontological properties used to resolve the coreference. This
is an important contribution to ensure the correct operation and improve the quality of
the coreference resolution algorithms.
   The main features of the proposed approach to coreference resolution are:

1. shift of the emphasis from discourse factors to the subject knowledge, primarily to
   the ontology of the subject domain to be populated through information extraction,
   disambiguation, and coreference resolution;
2. integration of computational and linguistic models and techniques of text analysis
   at the phase of semantic processing. Thus, weighted coreferential relations between
   objects are used for coreference resolution. In this process, the hypothetic corefer-
   ential relations are generated by the linguistic model, and the resolution (choice of
   the best hypothesis) is based on the statistical data;
3. scalability of the solution. Our approach can be enriched with new information ex-
   traction rules and referential factors.
                                                                                          13


The corpus with annotated coreference is necessary for studying different cases of
repeated mentions of events that need ontological information about polyadic rela-
tions to correctly resolve coreferences. Our future research will focus on general clas-
sification of such cases. We plan to develop special case-oriented coreference resolu-
tion techniques, particularly, by considering the relevance of ontological properties
for the evaluation of similarity of possible coreferents. Taking this into account, we
are faced with the problem of defining ontology formal properties that provide a bet-
ter solution to the tasks of extracting information from the text and, in particular, the
resolution of the coreference.

Acknowledgement. The study was supported by the Russian Foundation for Basic
Research, project 17-07-01600.


References
 1. Garanina, N., Sidorova, E., Kononenko, I., Gorlatch, S.: Using Multiple Semantic
    Measures For Coreference Resolution. Ontology Population. International Journal of
    Computing 16(3), 166–176 (2017).
 2. Dimitrov, M., Bontcheva, K., Cunningham, H., Maynard, D.: A Light-weight Approach to
    Coreference Resolution for Named Entities in Text. In: Branco, A., McEnery, T., Mitkov,
    R. (eds.) Anaphora Processing: Linguistic, Cognitive and Computational Modelling, vol.
    263, pp. 97-112. John Benjamins Publ., (2005).
 3. Sobha, L.: Anaphora Resolution Using Named Entity and Ontology. In: Johansson, C.
    (ed.) Proceedings of the Second Workshop on Anaphora Resolution (WAR II), NEALT
    Proceedings Series, vol. 2, pp.91-96 (2008).
 4. Mitkov, R.; Anaphora resolution: the state of the art. In: Working paper based on the
    COLING'98/ACL'98 tutorial on anaphora resolution. Wolverhampton (1999).
 5. Mitkov, R.: Anaphora resolution. In: Mitkov, R. (ed.) The Oxford handbook of computa-
    tional linguistics, ch.14, pp. 266-283. Oxford university press, N.Y. (2003),
    https://pdfs.semanticscholar.org/e782/00b1e3ba2a72de1ca9b9b2c5efa775151bfa.pdf, last
    accessed 2018/10/04.
 6. Elango, P.: Coreference Resolution: A Survey: Technical Report. UW-Madison (2006),
    https://ccc.inaoep.mx/~villasen/index_archivos/cursoTATII/Entidades Nombradas/Elango-
    SurveyCoreferenceResolution.pdf, last accessed 2018/04/01.
 7. Prokofyev, R., Tonon, A., Luggen, M., Vouilloz, L., Difallah, D.E., Cudr´e-Mauroux, P.:
    SANAPHOR: Ontology-Based Coreference Resolution. In: 14th International Semantic
    Web Conference, part I, LNCS, vol. 9366, pp. 458-473. Springer, Cham (2015).
 8. Lee, H., Recasens, M., Chang, A., Surdeanu, M., Jurafsky D.: Joint Entity and Event Co-
    reference Resolution across Documents. In: Proceedings of the Joint Conference on Empir-
    ical Methods in Natural Language Processing and Computational Natural Language,
    EMNLP-CoNLL 2012, pp. 489–500 (2012).
 9. Cybulska, A., Vossen, P.: “Bag of Events” pproach to Event Coreference Resolution.
    Supervised Classification of Event Templates. International Journal of Computational Lin-
    guistics and Applications 6(2), 11-27 (2015).
10. Borgo, S., Bozzato, L., Aprosio, A.P., Rospocher, M., Serafini L.: On Coreferring Text-
    extracted Event Descriptions with the aid of Ontological Reasoning. Technical Report
    (2016), https://arxiv.org/pdf/1612.00227.pdf, last accessed 2018/10/04.
14


11. Bejan, C.A., Harabagiu, S.: Unsupervised event coreference resolution with rich linguistic
    features. In: Proceedings of the 48th Annual Meeting of the Association for Computational
    Linguistics, pp.1412-1422 (2010).
12. Araki, J., Liu, Z., Hovy, E., Mitamura, T.: Detecting Subevent Structure for Event Coref-
    erence Resolution. In: Proceedings of the Ninth International Conference on Language Re-
    sources and Evaluation (LREC-2014), pp. 4553–4558 (2014).
13. Mayfield, J., Alexander, D., Dorr, B.J., Eisner, J., Elsayed, T., Finin, T., Fink, C., Freed-
    man, M., Garera, N., McNamee, P., Mohammad, S., Oard, D., Piatko, C., Sayeed, A.B.,
    Syed, Z., Weischedel, R.M., Xu, T., Yarowsky, D.: Cross-Document Coreference Resolu-
    tion: A Key Technology for Learning by Reading Association for the Advancement of Ar-
    tificial Intelligence. In: AAAI Spring Symposium: Learning by Reading and Learning to
    Read, pp.65-70 (2009).
14. Yatskevich M., Welty C., Murdock J.W. Coreference resolution on RDF Graphs generated
    from Information Extraction: first results. ISWC'06 Workshop on Web Content Mining
    with Human Language Technologies (2006).
15. Hladky, D., Ehrlich, C., Efimenko, I., Vorobyov V.: Discover Shadow Groups from the
    Dark Web. In: Web Intelligence and Security: Advances in Data and Text Mining Tech-
    niques for Detecting and Preventing Terrorist Activities on the Web, pp. 67-81 (2010).
16. Suleymanova, E., Trofimov, I.: A method for coreference resolution within information
    extraction. In: Program Systems: Theory and Applications 1(15), 15–30 (2013). (in Rus-
    sian)
17. Giv`on T.: Coherence in text, coherence in mind. Pragmatics and cognition 1(2), 171–227
    (1993).
18. Garanina, N., Sidorova, E.: Ontology Population as Algebraic Information System Pro-
    cessing Based on Multi-agent Natural Language Text Analysis Algorithms. Programming
    and Computer Software 41(3), 140–148 (2015).
19. Garanina, N, Sidorova, E., Seryi, A.: Multiagent Approach to Coreference Resolution
    Based on the Multifactor Similarity in Ontology Population. Programming and Computer
    Software 44(1), 23–34 (2018).
20. Garanina, N., Sidorova, E., Anureev, I.: Conflict resolution in multi-agent systems with
    typed relations for ontology population. Programming and Computer Software 42(4), 31–
    45 (2016).
21. Toldova, S., Roytberg, ., Nedoluzhko, А., Kurzukov, M., Ladygina, ., Vasilyeva, M.,
    Azerkovich, I., Grishina, Y., Sim, G., Ivanova, A., Gorshkov, D.: Evaluating Anaphora
    and Coreference Resolution for Russian. In: Computational Linguistics and Intellectual
    Technologies, Proceedings of the International Conference “Dialog 2013”, pp. 681–695.
    Publishing House of the RSUH, Moscow (2013).