<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anaphora Resolution from Social Media Text in Indian Languages (SocAnaRes-IL) : 2nd Edition-Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sobha Lalitha Devi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AU-KBC Research Centre, MIT Campus of Anna University</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Anaphora and its antecedent are to be identified for Natural Language Understanding (NLU) applications such as Information Extraction, Conversation Analysis, Opinion Mining, Machine Translation etc. There is a great need to develop applications such as Anaphora resolution, coreference resolution which can be used in NLU systems. This Second Edition of shared task on Anaphora resolution from the microblog text and conversation for languages such as Hindi, Tamil and Malayalam (Indian Languages) is similar to first edition with more microblog data and data on conversation. The aim is to provide annotated data in anaphora and enhance the research in this area. In this edition too we gave data from Hindi, Tamil, Malayalam from Indian languages and also from English which can be used as resource rich language, if one wants to take Indian languages as resources poor language. There were six registered groups who took data for development and testing but only one group submitted the run. They have used Neutralcoref network by huggingface.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Anaphora Resolution</kwd>
        <kwd>Social Media Text Analysis</kwd>
        <kwd>Indian Languages</kwd>
        <kwd>Hindi</kwd>
        <kwd>Malayalam</kwd>
        <kwd>Tamil</kwd>
        <kwd>English</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The second edition of SocAnaRes-IL is similar to its first edition in all its objectives. The difference
is that more data in microblog is provided and also included conversation data which are manually
created in Malayalam and translated to other languages.</p>
      <p>The social media platforms such as Twitter has generated large amount of microblog texts bringing
in a new challenge for NLP applications, which asks for a new perspective in language technology
research. These microbolg texts present a discourse genre which carries non-standard language
characteristics such as noisy or informal language, abbreviations which do not follow the regular
patterns, purposeful typos or spellings, use of non-alphanumerical symbols such as # and @ etc. This
requires new methodologies and techniques for processing such texts. The challenges brought in by
these texts are spread on all aspects of language computing right from developing the tag, collecting
the corpora, annotation of the corpus and the development of the system. The task proposed here is to
develop an anaphora resolution (AR) system from the Twitter data annotated for anaphor and its
antecedent. The languages considered are from Indo Aryan and Dravidian families and they are Hindi,
Tamil and Malayalam respectively and English. The objectives of the task are:

</p>
      <p>Creation of benchmark data for Anaphora Resolution in Indian language microblog texts -Twitter
data.</p>
      <p>Encourage researchers to develop novel systems for Anaphora Resolution.
 Providing opportunity to researchers to have comparison of different techniques.</p>
      <p>Anaphora resolution has been a challenging area in research and has been going on for more than 4
decades and the challenges became more when it shifted from normal texts to microblog texts. There is
very little work done for Indian languages. Most prominent Indian languages which have good anaphora
resolution systems are Malayalam, Tamil, Bengali and Hindi. As in every conversation, anaphors are
extensively used in microblog texts as well, only with the difference that their usage differs from that
in normal text. To give an example, the antecedent falling outside the text is a very common occurrence.
Similarly the antecedent not being a noun phrase but a hash tag or to an earlier tweet is very common.
It can also refer to an event that is being trolled which need not be explicitly marked in the current
tweet. Among the types of anaphors, pronominals are widely used.</p>
      <p>
        The approaches used for anaphora resolution are many and they can be classified as rule based
approach, Corpus based approach using Machine learning technique, Knowledge poor approach and
discourse based approach. The knowledge poor approach of Mitkov [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Approach with no deep parsing
of Kenndey &amp; Baguraev [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and salience based approach of Lappin &amp; Leas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are the most prominent
among them. The recent works use machine learning techniques such as decision tree, CRFs [
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ] and
Tree CRFs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. There are some works done where resource rich language is used for resolving
pronominals in less resource language [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>In this task, the data was collected from Twitter using Twitter API and annotated for Anaphor and
Antecedent pair alone. The method of collecting data is same as in the first edition of SocAnaRes2020.
In this edition we have included conversation data. The conversation data was created manually for
Malayalam and translated to other languages. The annotation details are given in section on 2, the corpus
does not have any other grammatical annotation such as POS or NP/VP Chunk. The participants were
free to use any external resources and any method.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Corpus Compilation</title>
      <p>There are various challenges in anaphora resolution on microblog texts from Twitter. The language
of Twitter language belongs to the Computer Mediated Communication (CMC) or Technology
Mediated Communication (TMC) languages, where there are restrictions in rendering. Tweets have
fixed character limitations and users are forced to communicate their idea using this limited number of
character. Hence there is language variation and various types of word and syntactic level variations
are brought in to accommodate the idea with in the given character span. And thus the language we are
analyzing are different in morphology and syntax such as non-standard language characteristics such
as noisy or informal language, abbreviations which do not follow the regular patterns, purposeful typos
, New spelling, use of non-alphanumerical symbols such as # and @ , use of symbols such as
emoticons/emojis , use of meta tags and hash tags . Another import aspect of twitter data is Code
mixing. The code mix is at the word level and also at the script level. Yet another characteristic is the
dialectal variations, which are inherent to all languages, are also seen in twitter data. The dialects can
be of various types such as regional, religious and community based. The users tend to use their dialect
and the words they use may not be there in a dictionary. We need to preprocess the data to normalize
the vocabulary.</p>
      <p>Tweets are generally very short and lack sufficient context to determine an antecedent of an anaphor.
Especially in the resolution of third person pronominals “he/they” (woh, ve, vo; avar), in atleast 20%
of the cases, the antecedent is not mentioned in the current Tweet, it is either in posts which was already
said a day before or is present in the troll. And is understood with world knowledge. An example Tweet
is given below:</p>
      <p>HI: “@vijayrk modi sarkar ke baad garibi kam hui hai, bank wale ab usko bhi loan dena
shuru kiya hai”</p>
      <p>(“@vijayrk after Modi government poverty has reduced, now banks are giving loans to
them”)</p>
      <p>Here in this tweet “usko” is the third person pronoun, and here it referring to poor people. The
antecedent for this pronoun can be identified only if we have world knowledge.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Corpus Collection</title>
      <p>The corpus was collected using Twitter API. Our aim was to collect conversations by recursively
retrieving the parent tweets to construct the full conversational tree structure. The tweets for training
were collected on 2 days in the month of June and for testing collected on 2 days in the month of August
for each language. First a set of tweets were collected using the event key phrases of that day such as
“election campaign of US”, “Government announcements” in the respective languages. After the first
set of tweets are collected, we identify if these tweets are retweets or reply tweets to any other tweet,
using “reply_ to _id” field of the Tweet data structure (or called as Tweet Object) of the Twitter API.
In this work we have not taken the re-tweets. The tweets which are reply tweets (let us call this as
RPT), for those we identify the original tweet (let us call this as ORT) for which this tweet is given as
reply, using the Tweet ID field. We collect the ORT’s and link each ORT with the respective RPT.
Thus a chain is formed. We perform this iteratively. In this work we have performed 5 iterations and
form a chain of tweets. Each set of Tweet chain is considered as 1 document or file.</p>
      <p>We also use additional method for identifying the chain of tweets in the stream of tweets obtained
from Twitter API. In this method, we see if the adjacent tweets have same set of twitter handle mentions
such as @narendramodi, @BJP4UP etc in the tweets. If there are same set of twitter handle mentions
then it is clue for the possibility of the tweets having the same discourse. Such tweets are analysed
manually and if found to be having same discourse we make those as chain of tweets. In the data
collection we faced an issue that there were many tweets which were just standalone tweets and could
not identify the chain of tweets. We have not considered such tweets.</p>
      <p>The conversation data was manually created for Malayalam by two native speakers who have are
trained in literature and linguistics. The created data is translated to Hindi, English, Tamil by translators
and verified by</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Corpus Annotation</title>
      <p>
        The corpus was first Tokenized as it is the initial step in all corpus creation for NLP. Here we used
the Tockenizer developed in house for Indian languages. The Tokenized data was given to PALinkA
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is an open source tool from University of Wolverhampton. PALinkA is the abbreviation of
Perspicuous and Adjustable Links Annotator, an Annotation tool. The corpus was annotated using,
PALinkA tool. The texts were annotated using the guidelines which treated all noun phrases (NPs) as
markables. It is a language independent tool, written in java. It is tested on Tamil and other Indian
Languages, also it is user friendly, we can annotate by selecting the markables and click on it.
      </p>
      <p>The input file to PALinkA has to be a well-formed XML file and the produced output is also a
wellformed XML. The pre-processed files with all syntactic information to be annotated should be in XML
format. We have considered both anaphor and antecedent as markables. For annotations, first anaphor
and antecedent should be marked as markables and if it is anaphoric, link is established between these
two markables. Finally all the possible anaphor and antecedents are tagged with index. After annotation,
these XML files are converted to column format files which are required for the machine learning
system.</p>
      <p>In this task, the annotators have to mark the referential links between entities in a text. Each anaphor
receives a unique ID, and a link between two entities is marked using these IDs. These IDs are
automatically managed by PALinkA.</p>
      <p>The corpus was annotated by language editors, who were either native speakers or had Masters level
education in that language. The corpus for Tamil and Malayalam were annotated by native speakers.
English and Hindi corpus was annotated by language editors who had Masters Qualification in those
languages.</p>
      <p>Each file was annotated by 2 language editors and it was observed that there was good agreement
between the annotators. We obtained a kappa score of 0.95 showing good inter annotator agreement.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3. The Problems in Annotation</title>
      <p>There were certain issues in annotation and they are the split antecedent and No Antecedent. These
issues are important issues in anaphora resolution and they are explained below with examples.
i) Split antecedent: Split antecedent is antecedent which consists of more than one NP.</p>
      <sec id="sec-5-1">
        <title>John waited for Maria. They went for pizza.</title>
        <p>In the above example they refers to both John and Maria. John and Maria are considered as split
antecedent.</p>
        <p>ii) Ellipsis: In conversation data there were many elliptical constructions where the antecedent was
elided in the discourse. In such cases we annotated the pronoun and the antecedent was manually
marked.
2.4.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Corpus Statistics</title>
      <p>The distribution of anaphors in each language varies and it is necessary to have a minimum number
from each category in the corpus. The below tables give the various types of anaphors and their
representation in training and testing corpus.</p>
    </sec>
    <sec id="sec-7">
      <title>3. Task Definition</title>
      <p>The task proposed is to develop an anaphora resolution (AR) system from the Twitter data and
conversation data annotated for anaphor and its antecedent. The languages considered are from Indo
Aryan and Dravidian families and they are Hindi, Tamil and Malayalam respectively. Here we also
have given annotated corpus for resource research language English with the view that it could be used
for resolving anaphor for less resource language. The corpus was annotated for anaphor and its
antecedent. The tokenization was done and the data was in column format.</p>
    </sec>
    <sec id="sec-8">
      <title>3.1. Groups Registered</title>
      <p>There were 6 groups registered and took the training and test data. Only one group submitted their runs.
The details of the groups and their affiliations are given in the Table 6 below.</p>
      <sec id="sec-8-1">
        <title>Groups</title>
      </sec>
      <sec id="sec-8-2">
        <title>Name</title>
        <p>Vijay
Kumari</p>
        <sec id="sec-8-2-1">
          <title>Pavan Kandru Abhinav Kumar</title>
        </sec>
        <sec id="sec-8-2-2">
          <title>Zengman Kou</title>
        </sec>
        <sec id="sec-8-2-3">
          <title>Yuning Zhang</title>
        </sec>
        <sec id="sec-8-2-4">
          <title>Bin Wang</title>
        </sec>
      </sec>
      <sec id="sec-8-3">
        <title>Team members</title>
        <p>Hriday Kedia,
Vijay Kumari,
Yashvardhan
Sharma
NIL
NIL
NIL
NIL
NIL</p>
      </sec>
      <sec id="sec-8-4">
        <title>Affiliation</title>
        <sec id="sec-8-4-1">
          <title>BITS Pilani</title>
          <p>iREL, IIIT Hyderabad</p>
        </sec>
        <sec id="sec-8-4-2">
          <title>Shiksha 'O' Anusandhan,</title>
          <p>Deemed to be University,
Bhubaneswar
Harbin Engineering
University (HEU Harbin,
China
Harbin Engineering
University (HEU), Harbin,
China
Harbin Engineering
University (HEU), Harbin,
China</p>
        </sec>
      </sec>
      <sec id="sec-8-5">
        <title>Language data requested</title>
        <p>All Languages</p>
        <sec id="sec-8-5-1">
          <title>All Languages</title>
        </sec>
        <sec id="sec-8-5-2">
          <title>All Languages English English Tamil</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>3.2. The Group Submitted</title>
      <p>The participants were asked to submit their test runs in the format as given in training data. Out of
the six groups who took the data, only one group submitted the run for Hindi. Others did not submit the
runs.</p>
      <p>The group by Ms. Vijay Kumari from BITS Pilani, Pilani alone submitted the run for the language
English. They have used a statistical pretrained model of NeuralCoref network by huggingface for
English.They have identified the named entities first and then the model is trained for a set of
features.This training is done by using a set of intial word embeddings and training them on Ontonotes
corpus.</p>
    </sec>
    <sec id="sec-10">
      <title>4. Evaluation of Test Run</title>
      <p>The evaluation was done using the standard evaluation metrics: Precision, Recall and F-measure.</p>
      <sec id="sec-10-1">
        <title>Language</title>
      </sec>
      <sec id="sec-10-2">
        <title>Precision</title>
      </sec>
      <sec id="sec-10-3">
        <title>Recall</title>
        <sec id="sec-10-3-1">
          <title>English 0.30 0.25</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5. Conclusion</title>
      <p>Vijay Kumari-BITs</p>
      <p>Pilani</p>
      <p>We have conducted the task on Anaphora resolution for microblog data from Twitter and
conversation data. There were six registrations. The data was given to all the six groups, but only one
group submitted the run. Since it is a difficult task many could not submit the runs. In future it is hoped
that the data given will help in research in this area.</p>
    </sec>
    <sec id="sec-12">
      <title>6. Acknowledgements</title>
      <p>Author would like to thank Dr. Pattabhi R K Rao, Mrs. Gracy L and Ms. Padmapriya for developing
the corpus.</p>
    </sec>
    <sec id="sec-13">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Akilandeswari</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sobha Lalitha Devi</surname>
          </string-name>
          .
          <article-title>"Anaphora Resolution in Tamil Novels"</article-title>
          , In Rajendra Prasath,
          <string-name>
            <surname>Philip O'Reilly</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          Kathirvalavakumar (Eds),
          <source>Mining Intelligence and Knowledge Exploration</source>
          , Springer LNAI Vol
          <volume>8891</volume>
          pp.
          <fpage>268</fpage>
          -
          <lpage>277</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Akilandeswari</surname>
            <given-names>A.</given-names>
          </string-name>
          , and Sobha Lalitha Devi.
          <article-title>"Conditional Random Fields Based Pronominal Resolution in Tamil"</article-title>
          , In
          <source>International Journal on Computer Science and Engineering</source>
          , vol.
          <volume>5</volume>
          (
          <issue>6</issue>
          ):
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C</given-names>
            <surname>Kennedy</surname>
          </string-name>
          and
          <string-name>
            <given-names>B</given-names>
            <surname>Boguraev</surname>
          </string-name>
          .
          <article-title>“Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser”</article-title>
          ,
          <source>in Proc. of the 16th International Conference on Computational Linguistics (COLING'96)</source>
          , Denmark, pp.
          <fpage>113</fpage>
          -
          <lpage>118</lpage>
          . (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitkov</surname>
          </string-name>
          . “
          <article-title>Factors in Anaphora Resolution: They are not the only Things That Matter. A Case Study Based on Two Different Approaches”</article-title>
          ,
          <source>in Proc. of the ACL'97/EACL'97 Workshop on Operational Factors in Practical, Robust Anaphora Resolution, Spain</source>
          . pp.
          <fpage>14</fpage>
          -
          <lpage>21</lpage>
          . (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lappin</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Leass</surname>
          </string-name>
          . “
          <article-title>An Algorithm for Pronominal Anaphora Resolution”</article-title>
          ,
          <source>Computational Linguistics</source>
          , vol.
          <volume>20</volume>
          ,
          <issue>4</issue>
          . pp.
          <fpage>535</fpage>
          -
          <lpage>561</lpage>
          . (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Constantin</given-names>
            <surname>Orasan</surname>
          </string-name>
          . “
          <article-title>PALinkA: a highly customizable tool for discourse annotation”</article-title>
          .
          <source>In Proceedings of the 4th SIGdial Workshop on Discourse and Dialog</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>43</lpage>
          , Sapporo,
          <string-name>
            <surname>Japan.</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sobha</given-names>
            <surname>Lalitha</surname>
          </string-name>
          <article-title>Devi. “Resolving Pronouns for a Resource-Poor Language, Malayalam using Resource-Rich Language, Tamil”</article-title>
          ,
          <source>In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP</source>
          <year>2019</year>
          ), pp.
          <fpage>611</fpage>
          -
          <lpage>618</lpage>
          ., (
          <year>2019</year>
          ). https://doi/10.26615/
          <fpage>978954</fpage>
          -452-056-4_
          <fpage>072</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Sundar</surname>
          </string-name>
          <string-name>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          and Sobha Lalitha Devi. “
          <article-title>Pronominal Resolution in Tamil ing Tree CRFs”</article-title>
          ,
          <source>In Proceedings of 6th Language and Technology Conference</source>
          ,
          <article-title>Human Language Technologies as a challenge for Computer Science</article-title>
          and Linguistics - 2013, Poznan, Poland, LNAI pp.
          <fpage>333</fpage>
          -
          <lpage>337</lpage>
          . (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Sobha</given-names>
            <surname>Lalitha Devi</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>'SocAnaRes-IL20: Anaphora Resolution from Social Media Text in Indian Languages @ FIRE 2020 - An Overview', In the Forum for Information Retrieval and Evaluation-2020, IDRBT</article-title>
          , Hyderabad, India.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>