<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Amrita CEN@FACT: Factuality Identi cation in Spanish Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Center for Computational Engineering</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Networking (CEN)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Cybersecurity Systems and Networks Amrita School of Engineering</institution>
          ,
          <addr-line>Amritapuri Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Prabaharan Poornachandran</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1838</year>
      </pub-date>
      <fpage>111</fpage>
      <lpage>118</lpage>
      <abstract>
        <p>This paper presents the description of the system used by the team Amrita CEN for the shared task on FACT (Factuality Analysis and Classi cation Task) at IberLEF2019 (Iberian Languages Evaluation Forum) workshop. The goal of the task was to automatically annotate an event with its factuality status. Factuality status is categorized into three as Fact, Counter Fact and Unde ned. Our proposed system predicts the factuality of an event with a prediction accuracy of 72.1%. The classi cation model for this task was trained using Random Forest classi er which uses word embedding of the events as input features. The word embedding of an event was generated by using Word2vec algorithm. Random Forest was implemented by giving higher weights to minority classes and lesser weights to majority classes so that more number of elements in the minority class will be predicted precisely.</p>
      </abstract>
      <kwd-group>
        <kwd>Factuality classi cation Spanish text Word2vec Weighted Random Forest</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In Natural Language Understanding (NLU), identi cation of the characteristics
of an event has greater signi cance. Factuality is one of the principal
characteristics of an event [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The factuality of an event shows the happening of an event
in the past or present. It also helps to know whether an event has not yet
happened or it is just an illusion of a writer. However, in day-to-day conversations,
factuality of an event often expressed in a vague manner and thereby leaves some
degree of ambiguity in the context of occurrence. This uncertainty is ubiquitous
in all sorts of situations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and hence makes the automatic prediction a tough
task. The accurate prediction of the factuality of an event is vital in deducing
various knowledge related to that event. The understanding of an event when it
is identi ed as a fact is di erent from the reasoning about that event when it is
recognized as a counter fact or an unde ned event [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, the proper
categorization of events into its actual factuality is very important and is widely used
in many applications such as temporal organization of events, sentiment
analysis and opinion detection and question answering [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Despite its considerable
importance in NLU, this task is underexplored especially in Spanish. Wonsever
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Wonsever et.al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] put signi cant e ort in developing an annotated
corpora as well as automatic models for the analysis and classi cation of event
factuality in Spanish texts. But, still this research is in its edgling stages.
      </p>
      <p>
        Factuality Analysis and Classi cation Task (FACT) is a shared task
organized as part of IberLEF2019 for recognizing the factuality of an event in a
Spanish text. In this task, events are tagged with three labels - Fact, Counter
Fact and Unde ned. The goal of the task was to encourage the research in this
eld through the development of computational models for the automatic
prediction of the factuality of an event. Our team, Amrita CEN developed a machine
learning model which used Word2vec [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for extracting features from the
event words and Random Forest algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for classi cation. We used a
weighted Random Forest algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for classifying events because the
number of instances in Counter Fact class was very less compared to other two classes
(Fact and Unde ned). The performance of the model was evaluated using
F1score (macro averaging) and accuracy score and our model achieved the scores
of 0.561 and 72.1% respectively.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Description of the task</title>
      <p>The objective of the shared task "FACT: Factuality Analysis and Classi cation
Task" was to classify the events expressed in Spanish texts as Fact, Counter
Fact or Unde ned by considering their factuality status into account. The events
which belong to the category of "Facts" are those events which are expressed
as real in either past or current circumstances. The "Counter Facts" events are
those which never happened so far whereas the "Unde ned" events are neither
Fact nor Counter fact because the author was uncertain about the existence of
such events.</p>
      <p>The training data contains 56 Spanish texts of which 4,343 events were
labelled as Fact(F) or Counter Fact(CF) or Unde ned(U). Among these labelled
events, the number of distinct event names was 2,053. 1,428 words in the
vocabulary occurred only once and the word with highest frequency of occurrence was
"es" with 171 occurrences. The word "ha" also appeared more than 100 times
in the list with 162 appearances and is visible in the Figure 1 which shows the
frequency of occurrences of top 50 words and their counts in the training data.</p>
      <p>
        In the test data, there were 15 Spanish texts with 1,075 unidenti ed events.
Out of these 1,075 events, 715 were unique words and 580 of these unique words
were found only once in the dataset. Among these words, only 8 words existed
more than 10 times. Another interesting trend we found in both dataset was
that, out of 20 most frequently occurred events in the training set, 16 events
were present in the list of top 20 events in the test data with highest frequency
of occurrence. This trend can be observed in Figures 1 and 2.
The training as well as testing dataset for the task were given as an XML le.
The rst task was to extract features for the events from the data and represent
them in terms of vectors. Word embedding algorithms were used for this
representation. We tried both Word2vec3 and FastText4 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] algorithms with varying
embedding dimensions and observed that Word2vec performs better than
FastText in the classi cation. Various parameters used for building the Word2vec
model is given in the Table 2. We also observed that the embedding
dimension beyond 300 didn't produce a signi cant change in the performance of the
classi ers.
The performance of Support Vector Machine (SVM) was poor and hence we
concluded that, the word vectors were highly non-linearly separable. Among all the
classi ers, Random Forest achieved the best training accuracy. When the model
was trained with the word vectors as features, it was found that most of the data
points in Class "CF" were classi ed as "F". The less number of instances in the
CF class in the training data was the reason for this misclassi cation. Confusion
matrix obtained for this modeling is shown in Figure 3.
      </p>
      <p>Even though the model gave a good training accuracy of 90.74%, we decided
to use a weighted Random Forest classi er for training with the motivation to
increase the classi cation accuracy of minority class "CF". From, the confusion
matrix in Figure 3, it is clear that only 43.92% of "CF" class was correctly
classi ed as "CF". This may a ect the performance of the system when tested with
unknown samples. Therefore, we applied a weighted Random Forest classi er.
It attained an overall accuracy of 88.46% which is relatively less than the
unweighted Random Forest accuracy. However, when the class-wise classi cation
was analysed, most of the instances (71.76%) in "CF" were classi ed as "CF"
itself. The confusion matrix for the weighted Random Forest is shown in Figure
4. The weights used for "CF", "F" and "U" were 5.68, 0.5 and 1.24 respectively
which was computed using the Equation 1.</p>
      <p>weights =</p>
      <p>number of instances
number of classes bin count (y)
(1)
Where bin count (y) is the number of instances in each classes.</p>
      <p>The training performance of both unweighted and weighted Random Forest
is described in Table 3. We used accuracy score, macro-Precision, macro-Recall
and macro-F1-score evaluating the training model.</p>
      <p>The shared task organizers used macro-F1-score and accuracy for evaluating
the predictions of class labels for the test data. Six teams participated in the
contest including the baseline system, of which our system scored the highest
both in terms of macro-F1 and accuracy. The results are shown in Table 5.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>The identi cation of the factuality of an event is an important task in Natural
Language Understanding (NLU). The factuality of an event acts as an additional
feature for many Natural Language Processing (NLP) applications like question
answering and opinion detection. Automatic identi cation of an event as Fact
or Counter Fact or Unde ned is a multi-class classi cation problem. In this
paper, we used weighted Random Forest classi er for learning the patterns in
the data which was represented using Word2vec algorithm. The model obtained
an accuracy of 72.1 and an F1-score (macro) of 0.561 when tested with a set of
unknown events.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rudinger</surname>
          </string-name>
          , Rachel, Aaron Steven White, and Benjamin Van Durme,
          <source>Neural models of factuality</source>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>02472</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Saur</surname>
          </string-name>
          , Roser, and James Pustejovsky,
          <article-title>Are you sure that this happened? assessing the factuality degree of events in text</article-title>
          ,
          <source>Computational Linguistics</source>
          ,
          <volume>38</volume>
          (
          <issue>2</issue>
          ),
          <fpage>261</fpage>
          -
          <lpage>299</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Saur</surname>
          </string-name>
          ,
          <article-title>Roser, A factuality pro ler for eventualities in text</article-title>
          ,
          <source>Unver entlichte Dissertation</source>
          , Brandeis University. Zugri auf http://www.cs.brandeis.edu/ roser/pubs/sauriDiss (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Wonsever</surname>
            , Dina,
            <given-names>Marisa</given-names>
          </string-name>
          <string-name>
            <surname>Malcuori</surname>
          </string-name>
          , and Aiala Ros Furman, Factividad de los eventos referidos en textos,
          <source>Reportes Tcnicos 09-12</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wonsever</surname>
            , Dina,
            <given-names>Aiala</given-names>
          </string-name>
          <string-name>
            <surname>Ros</surname>
            , and Marisa Malcuori,
            <given-names>Factuality</given-names>
          </string-name>
          <string-name>
            <surname>Annotation</surname>
          </string-name>
          and Learning in Spanish Texts, LREC (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , Tomas, Kai Chen, Greg Corrado, and
          <article-title>Je rey Dean, E cient estimation of word representations in vector space</article-title>
          ,
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            , Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and
            <given-names>Je</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          ,
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Bojanowski</surname>
          </string-name>
          , Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov,
          <article-title>Enriching word vectors with subword information</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          ,
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Liaw</surname>
          </string-name>
          , Andy, and Matthew Wiener,
          <article-title>Classi cation and regression by randomForest</article-title>
          , R news,
          <volume>3</volume>
          ;
          <issue>2</issue>
          (
          <issue>3</issue>
          ),
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Premjith</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soman</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ratnam</surname>
            ,
            <given-names>D.J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Embedding</surname>
            <given-names>Linguistic</given-names>
          </string-name>
          <article-title>Features in Word Embedding for Preposition Sense Disambiguation in EnglishMalayalam Machine Translation Context</article-title>
          ,
          <source>Recent Advances in Computational Intelligence</source>
          , Springer, Cham,
          <fpage>341</fpage>
          -
          <lpage>370</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Xie</surname>
            , Yaya,
            <given-names>Xiu</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>E. W. T.</given-names>
          </string-name>
          <string-name>
            <surname>Ngai</surname>
          </string-name>
          , and Weiyun Ying,
          <article-title>Customer churn prediction using improved balanced random forests</article-title>
          ,
          <source>Expert Systems with Applications</source>
          , Elsevier,
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <volume>5445</volume>
          {
          <fpage>5449</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <article-title>Fabian and Varoquaux, Gael and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others, Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          , 2825{
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>