Automated)Anaphora)and)Co!reference&Resolution&for&
    Lithuanian(Language(Combining(Results(from(Different(
                    Text%Analysis%Stages%"

                          Voldemaras Žitkus and Lina Nemuraitė

    Kaunas University of Technology, Department of Information Systems, Kaunas, Lithuania
                {voldemaras.zitkus, lina.nemuraite}@ktu.lt


        Abstract. The goal of the research is to make first steps for automated anaphora
        and co-reference resolution in Lithuanian language with respect to limited pre-
        processing tools and resources, by combining concepts and algorithms from dif-
        ferent text analysis phases for this purpose. Existing resolution methods are cre-
        ated for major languages, e.g., English, and usually are language-specific. On
        the base of analysis of existing methods, a taxonomy of anaphoric objects is
        created and initial algorithms are proposed for solving anaphoras and co-
        references in Lithuanian language.

        Keywords: anaphora resolution, co-references, natural language processing, an-
        notation, Lithuanian language.


1      Introduction

   While amounts of available information are rapidly increasing, research in Natural
Language Processing (NLP) field is becoming more and more important. Unfortu-
nately, the most of the NLP work has focused on English and other major languages
leaving this field underdeveloped for smaller languages. Due to this situation, the
Lithuanian language lacks mature NLP tools and resources while some parts of NLP
process have not been researched at all. Anaphora and co-reference resolution is one
of such cases for Lithuanian language.
   In NLP, the anaphora is an expression interpretation of which depends on another
expression in context [1]. Anaphora relation between the anaphoric object and its
antecedent is an intra-linguistically determinable relation. It is nor transitive, nor re-
flexive, nor symmetric one [2]. The interpretation of an anaphoric object requires
another object (antecedent) that it refers to, e.g.:
• Tom skipped the school today. He was sick.
   The relationship between “He” and “Tom” is called an anaphora. In this case, “He”
is an anaphoric object that refers to its antecedent “Tom”. Without being able to solve
anaphoric expressions, we would not know why Tom skipped the school nor who was


Copyright © 2015 by the authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.


                                            164
sick. This information is very important when we try to extract semantic information
from various texts.
   Terms “anaphora” and “co-reference” are often used together or mistaken for each
other. Sometimes, anaphoric and co-referential relations can coincide, but it is not
always true. The co-reference is the equivalence relation between two text items
(words or phrases, having the same meaning) [2]. E.g., for being referents, “lecturer”
and “Mark Smith” should represent the same person. The co-reference often requires
access to extra-linguistic information (the additional knowledge about the world).
   The wider problem with anaphora and co-reference resolution is in the fact that
even for major languages this process remains semi-automated what is entirely unac-
ceptable to desirable analysis of the existing textual information. This is caused by
imperfection of pre-processing methods and tools, needed for preparing texts for
anaphora and co-reference resolution, and the lack of reliable resources, e.g., annotat-
ed corpora for resolution algorithms, based on machine learning, etc.
   Research questions. This research is devoted for making the first steps in filling
the gap in anaphora and co-reference resolution in Lithuanian language. It raises the
following research questions:
1. Can existing anaphora and co-reference resolution methods, designed for other
   languages, be adapted to Lithuanian language? Can quality assessments of these
   algorithms be comparable with assessments of those created for major languages?
2. What automated methods and algorithms can be developed with current availabil-
   ity of pre-processing tools and resources in Lithuanian language?
3. How semantic information can be increased with additional (not limited to anapho-
   ra) co-reference resolution?
   The research methodology is based on the Design Science Research and Infor-
mation System Research Framework defined by Hevner et al. (2004) [3]. Analysis of
relevant research works is being done in anaphora and co-reference resolution field
for other languages. On the base of analysis made, existing methods are being adapted
and new ones suitable for Lithuanian language are being created. Experiments will be
performed for evaluating and improving developed methods. Resulting work will
supplement the existing body of knowledge and serve as a foundation for future
works on automated solving anaphoric expressions and co-references in Lithuanian
and, possibly, other languages.
   The rest of the paper is structured as follows. Section 2 overviews the related
works. Section 3 presents the main idea of this research and provides some initial
results that have been achieved. Section 4 draws conclusions and presents future
works.


2    Literature review of the problem domain and related solutions

  This section provides analysis of various anaphora resolution methods that were
analysed in this dissertation.
  Syntax based approaches. One of the earliest anaphora resolution methods was


                                        165
proposed by Hobbs in 1977 [4] (often called as Hobbs’s naive algorithm). Despite
being the old method, it is still referenced and measured against today. The algorithm
is based on fully parsed syntactic tree, finding a pronoun and navigating through the
syntactic tree to determine its possible antecedent (noun). When candidate is found,
the agreement in gender, number, etc., between pronoun and noun is determined on
the base of morphological and real world knowledge. If the agreement is met then
noun is selected as the antecedent for the pronoun, otherwise algorithm looks for
another candidate. This approach encounters problems when there are several possible
candidates. In such case, the algorithm would pick the first one while the other one
might be correct.
    Centring theory (CT). Centres link one utterance with other utterances in dis-
course. Each utterance has one backward-looking centre and a number of possible
forward-looking centres that a particular utterance has evoked. Forward-looking cen-
tres are ranked by discourse salience and grammatical rules; the highest rated centre is
called the preferred centre [5]. Brennan et al. presented one of the most known ap-
proaches (often called as BFD) that utilize CT in 1987 [6]. Tetreault proposed an
alternative for this approach in 1999 (called Left-Right Centering) [7].
    Salience factors. While salience plays a role in most of the approaches, usually it
is not considered as the main criteria for anaphora resolution. Notable exception is
RAP (Resolution of Anaphora Procedure) algorithm introduced by Lappin and Leass
in 1994 [8]. Only gender, number and person of possible antecedents is taken into
consideration. With each new sentence, weights of salience factors are degraded by a
factor of 2. Precise weights were reached after experimentation and numerous ad-
justments.
    Semantic information of Universal Networking Language (UNL). Anaphora
resolution strategies based on UNL were proposed for Tamil language [9]. UNL rep-
resents semantic information of natural language texts in hyper-graphs of concepts
and 46 types of relationships. Anaphoric expressions are resolved based on the types
of relationships between nodes, similarly to centring and activation theories.
    Semantically Enhanced Domain Specific Natural Language (SE-DSNL). This
approach is targeted at NLP purposes in general but can also be used for rather sim-
plistic anaphora resolution [10]. It uses only two features (distance measuring in syn-
tax tree and semantic compatibility) and focuses only on pronouns.
    Statistical methods. One of the earliest statistical approaches was proposed by Ge
et al. in 1998 [11]. The approach considers various factors for resolving anaphoric
relations and investigates the relative importance of these factors while adding them
incrementally.
    Machine learning. First learning system to achieve comparable results with other
approaches was presented by Soon et al. [12]. Their system includes tokenization and
segmentation, morphological processing, part of speech tagging, noun phrase identifi-
cation, Named Entity Recognition (NER), nested noun phrase extraction, and seman-
tic class determination. In order to improve learning capabilities of the engine, authors
introduced 12 feature vectors. Ng and Cardie expanded this work [13].
    Comparison of approaches. The comparison of analysed resolution methods is
presented in Table 1. The precision that was reported in the original research is only


                                        166
given. Recall is not given since some of the methods did not provide its evaluation.
The evaluations were not done against the same corpora; therefore, their results are
meant to give a general idea of the state of anaphora resolution.

                  Table 1. Comparison of anaphora resolution approaches

    Method        Foundation         Types of anaphoric expres-            Precision
                                           sions resolved
                                      Main pronouns: he, she,        81.8–91.7% (depends
    Hobbs          Syntactic
                                             they, it                   on type of text)
                                      Pronouns (their types are       49–90% (depends on
      BFP       Centring Theory
                                           not specified)                  type of text)
Left-Right     Modified Centring      Pronouns (their types are      72.1-81% (depends on
Centering          Theory                  not specified)                  type of text)
                                       Third person pronouns,        85–86%; reaches 89%
      RAP       Salience factors      reflexive and reciprocal          with inclusion of
                                               anaphors               statistical algorithms
 Statistical                         He, she, it and their various
               Probabilistic model                                        82.9–84.2%
 approach                                        forms
 Machine                              Noun phrases (including
               Machine learning                                           65.5–67.3%
  learning                                    pronouns)
                   Universal
UNL based
                  Networking                   Pronouns                       67%
 approach
                   Language
                 Pattern based       Pronouns, but can be used
 SE-DSNL                                                                     81.3%
                   approach           for other anaphora types


3      Preliminary ideas of the proposed approach and the initial
        results

3.1     Preliminary ideas and the principal schema of the approach

   The goal of the research is to make first steps for automated anaphora and co-
reference resolution in Lithuanian language with respect to limited pre-processing
tools and resources, by combining concepts and algorithms from different text analy-
sis phases for this purpose. In order to reach the goal, the following tasks were stated:
1. Analyse current methods and resources used for anaphora resolution in English and
   other major languages;
2. Develop rules and algorithms for anaphora and co-reference resolution in Lithuani-
   an language;
3. Implement rules and algorithms for anaphora and co-reference resolution suitable
   to improve semantic analysis and search in Lithuanian text corpora;
4. Conduct experiment for evaluating suitability of created rules and algorithms;
5. Evaluate developed method with recall and precision measures as main criteria.


                                         167
   The principal schema of the approach is presented in Fig. 1. Currently, anaphora
resolution algorithms can be based on morphological annotations and entities, recog-
nized by Named Entity Recognition (NER) algorithms, whereas existing syntactic
annotation tools for Lithuanian language have not reached the sufficient quality yet.


                       Fig. 1. The principal schema of the approach

   Co-reference resolution algorithms can be applied after semantic annotation. There
are more possibilities for discovering co-references, but they also are based on exist-
ence of pre-processing methods, such as, e.g., Semantic Role Labeling, so currently
they are beyond the scope of this research.


3.2    Taxonomy of anaphoric expressions

   This research combines multiple approaches to anaphora taxonomy by extending
the main morphology-based taxonomy with additional generalization sets for provid-
ing the better coverage on the anaphora phenomenon [14]. The distinction between
categories of lexical semantics and domain semantics allows identifying anaphoric
expressions from multiple viewpoints.
   The created taxonomy reflects the actual situation that the same anaphoric object
may be classified as a pronoun (morphological type), agent (lexical semantics type)
and person (domain semantics type). Some part of anaphoric relations may be detect-
ed using morphological annotations; additional relations can be found from results of
lexical semantic analysis, and yet another part can be discovered from the domain
semantics represented in ontology. The generic domain semantics categories, charac-
teristic for various domains, are extended with state, domain role and abstract object,
which are important for anaphora resolution. The “abstract object” represents such
words or phrases as “person”, “enterprise”, “young man”, etc., that can have anaphor-
ic references. Similarly, domain roles as “president”, “teacher”, “politician”, etc., can
be used for discovering anaphoric relations. Morphological classification is language
specific, but lexical and domain semantic classification can be used for other lan-
guages too.


                                         168
3.3    Anaphora resolution algorithm based on morphological and NER
       annotations

  This section presents the proposed anaphora resolution algorithm (Fig. 2), which
was created for Semantic Search Framework for Lithuanian Language. The algorithm
was investigated on a corpus that collects articles from various Lithuanian Internet
news sites focusing on political and economic matters.
  Proposed resolution method focuses on the cases where anaphoric objects are per-
sonal pronouns (subtypes of main pronouns who in turn are subtypes of pronouns in
morphological categorization) and used to express persons (subtypes of domain
agents in domain semantics categorization).


      Fig. 2. Anaphora resolution method based on morphological and NER annotations

  The algorithm searches for the valid pronoun for which anaphora resolution was
not performed yet, and checks it against the pre-set list of invalid pronouns that usual-


                                        169
ly are either pleonastic or tend not to refer to persons. If the pronoun is valid, we go
backwards until we find a noun that is recognized as a person by NER. If a suitable
noun is not found, we move backwards to the next sentence and perform the same
search until we either find a suitable noun, or until we pass X sentences backwards
from the pronoun; then we move forward Y sentences from the pronoun searching for
a suitable noun.
   If we find a suitable noun then we determine if it agrees in number and gender with
the pronoun. If noun and pronoun agree in number and gender then their pair is added
to anaphora annotations and we return to the first step.
   The algorithm can be considered naive since it takes the first suitable noun that
agrees in a number and gender as an antecedent (or postcedent), and the alternatives
are not considered. The evaluation of the algorithm was done against corpora of 500
Internet news portal articles focusing on politics and economics. Algorithm managed
to achieve 61% recall and 74% precision.


3.4    Co-reference resolution algorithm

   Co-reference relation means relation between equivalent objects. In the proposed
approach, equivalent objects are identified after semantic annotation (Fig. 2), during
which named entities, having the same meaning but, possibly, the different represen-
tation form, are marked as different individuals.


                      Fig. 3. Algorithm for resolving of co-references

    Currently developed intertextual co-reference resolution algorithm merges such in-
dividuals into a single entity. Here, “Valid types of nouns” is a list of nouns that
algorithm can resolve, e.g., persons, locations, organizations; “Resolution rules” is a
list of rules that are valid for specific type (or several types) of nouns. The algorithm
was tested for entities, having various modifications of their names, e.g., John Smith,
J. Smith, J. S., Mr. Smith, John Smith’as, John Smithas, etc., co-refering to the same
entity. The experiment was conducted with 277784 articles having 3058015 individu-
als. After merging, the number of individuals has decreased till 77532 (i.e., about 39
times). Unfortunately, due to the early stage of development we currently cannot
provide the evaluation of precision and recall of the proposed algorithm.


                                         170
4     Conclusions and future works

   The paper presents the ideas and initial results after 2 years of research. The con-
tribution of this research is the created taxonomy of anaphoric objects and algorithms
for automated anaphora and co-reference resolution in Lithuanian language. Its
uniqueness is in the fact that anaphoric relations and co-references are identified from
multiple viewpoints via analysing categories of both lexical semantics and domain
semantics. Anaphora and co-reference resolution algorithms are combined from dif-
ferent stages of the text pre-processing process. The research is done in the very early
stage of coping with anaphora and co-reference resolution problem in Lithuanian
language, with respect to imperfect pre-processing algorithms and limited resources.
Therefore, the analyzed methods for other languages could not be adapted. However,
the assessments of our algorithms are comparable with assessments of those created
for major languages.
   The future work is directed towards creating more sophisticated anaphora and co-
reference resolution algorithms using emerging tools and resources for Lithuanian
language that are being developed simultaneously.


References
 1. Mitkov, R.: Anaphora Resolution. Longman, London (2002)
 2. Van Deemter, K., Kibble, R.: On coreferring: Coreference in MUC and related annotation
    schemes. Computational Linguistics 26(4), pp. 629–637 (2000)
 3. Hevner, A.R., March, S. T., Park, J., Ram, S.: Design Science in Information Systems Re-
    search. MIS Quarterly, 28(1), 75–105 (2004)
 4. Hobbs, J.R.: Resolving Pronoun References. In: Grosz, B., Sparck-Jones, K., Webber, B.
    (eds.) Reading in Natural Language Processing, 99, pp. 339–352, Morgan Kaufmann Pub-
    lishers Inc. (1986)
 5. Kibble, R.: A Reformulation of Rule 2 of Centering Theory. Computational Linguistics,
    27(4), 579–587 (2001)
 6. Brennan, S.E., Friedman, M.W., Pollard, C.J.: A Centering Approach To Pronouns. In:
    Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics
    (ACL), pp. 155–162, Philadelphia, USA (1987)
 7. Tetreault, J.R.: A Corpus-Based Evaluation of Centering and Pronoun Resolution. Compu-
    tational Linguistics, 27(4), 507–520 (2001)
 8. Lappin, S., Leass, H.J.: An Algorithm for Pronominal Anaphora Resolution. Computation-
    al Linguistics 20(4), 535–561 (1994)
 9. Balaji, J., Geetha, T. V., Parthasarathi, R., Karky, M.: Anaphora Resolution in Tamil Us-
    ing Universal Networking Language. In: Proceedings of the Indian International Confer-
    ence on Artificial Intelligence, IICAI-2011, Karnataka, India (2011)
10. Fischer, W.: Linguistically Motivated Ontology-Based Information Retrieval. Doctoral
    dissertation, University of Augsburg, GER (2013)
11. Ge, N., Hale, J., Charniak, E.: A Statistical Approach to Anaphora Resolution. In: Pro-
    ceedings of the Sixth Workshop of Very Large Corpora, pp. 161–170 (1998)
12. Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Reso-
    lution of Noun Phrases. Computational Linguistics 27(4), 521–544 (2001)
13. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution.


                                          171
    In Proceedings of the 40th Annual Meeting of the Association for Computational Linguis-
    tics (ACL), pp. 104–111, Philadelphia, USA (July 2002)
14. Zitkus, V., Nemuraite, L.: Taxonomy of anaphoric expressions as a starting point for
    anaphora resolution in Lithuanian corpus. Informacinės technologijos (IVUS 2014), Kau-
    nas, Technologija. pp. 177–182 (2014).


                                         172