-

Amrita CEN@FACT: Factuality Identi cation in Spanish Text

Center for Computational Engineering

0 2

Networking (CEN)

0 2 0 Amrita School of Engineering , Coimbatore Amrita Vishwa Vidyapeetham , India 1 Center for Cybersecurity Systems and Networks Amrita School of Engineering , Amritapuri Amrita Vishwa Vidyapeetham , India 2 Prabaharan Poornachandran

1838

111 118

This paper presents the description of the system used by the team Amrita CEN for the shared task on FACT (Factuality Analysis and Classi cation Task) at IberLEF2019 (Iberian Languages Evaluation Forum) workshop. The goal of the task was to automatically annotate an event with its factuality status. Factuality status is categorized into three as Fact, Counter Fact and Unde ned. Our proposed system predicts the factuality of an event with a prediction accuracy of 72.1%. The classi cation model for this task was trained using Random Forest classi er which uses word embedding of the events as input features. The word embedding of an event was generated by using Word2vec algorithm. Random Forest was implemented by giving higher weights to minority classes and lesser weights to majority classes so that more number of elements in the minority class will be predicted precisely.

Factuality classi cation Spanish text Word2vec Weighted Random Forest

In Natural Language Understanding (NLU), identi cation of the characteristics of an event has greater signi cance. Factuality is one of the principal characteristics of an event [ 1 ]. The factuality of an event shows the happening of an event in the past or present. It also helps to know whether an event has not yet happened or it is just an illusion of a writer. However, in day-to-day conversations, factuality of an event often expressed in a vague manner and thereby leaves some degree of ambiguity in the context of occurrence. This uncertainty is ubiquitous in all sorts of situations [ 2 ] and hence makes the automatic prediction a tough task. The accurate prediction of the factuality of an event is vital in deducing various knowledge related to that event. The understanding of an event when it is identi ed as a fact is di erent from the reasoning about that event when it is recognized as a counter fact or an unde ned event [ 3 ]. Therefore, the proper categorization of events into its actual factuality is very important and is widely used in many applications such as temporal organization of events, sentiment analysis and opinion detection and question answering [ 3 ]. Despite its considerable importance in NLU, this task is underexplored especially in Spanish. Wonsever et al. [ 4 ] and Wonsever et.al [ 5 ] put signi cant e ort in developing an annotated corpora as well as automatic models for the analysis and classi cation of event factuality in Spanish texts. But, still this research is in its edgling stages.

Factuality Analysis and Classi cation Task (FACT) is a shared task organized as part of IberLEF2019 for recognizing the factuality of an event in a Spanish text. In this task, events are tagged with three labels - Fact, Counter Fact and Unde ned. The goal of the task was to encourage the research in this eld through the development of computational models for the automatic prediction of the factuality of an event. Our team, Amrita CEN developed a machine learning model which used Word2vec [ 6 ], [ 7 ] for extracting features from the event words and Random Forest algorithm [ 9 ] [ 10 ] for classi cation. We used a weighted Random Forest algorithm [ 11 ] for classifying events because the number of instances in Counter Fact class was very less compared to other two classes (Fact and Unde ned). The performance of the model was evaluated using F1score (macro averaging) and accuracy score and our model achieved the scores of 0.561 and 72.1% respectively. 2

Description of the task

The objective of the shared task "FACT: Factuality Analysis and Classi cation Task" was to classify the events expressed in Spanish texts as Fact, Counter Fact or Unde ned by considering their factuality status into account. The events which belong to the category of "Facts" are those events which are expressed as real in either past or current circumstances. The "Counter Facts" events are those which never happened so far whereas the "Unde ned" events are neither Fact nor Counter fact because the author was uncertain about the existence of such events.

The training data contains 56 Spanish texts of which 4,343 events were labelled as Fact(F) or Counter Fact(CF) or Unde ned(U). Among these labelled events, the number of distinct event names was 2,053. 1,428 words in the vocabulary occurred only once and the word with highest frequency of occurrence was "es" with 171 occurrences. The word "ha" also appeared more than 100 times in the list with 162 appearances and is visible in the Figure 1 which shows the frequency of occurrences of top 50 words and their counts in the training data.

In the test data, there were 15 Spanish texts with 1,075 unidenti ed events. Out of these 1,075 events, 715 were unique words and 580 of these unique words were found only once in the dataset. Among these words, only 8 words existed more than 10 times. Another interesting trend we found in both dataset was that, out of 20 most frequently occurred events in the training set, 16 events were present in the list of top 20 events in the test data with highest frequency of occurrence. This trend can be observed in Figures 1 and 2. The training as well as testing dataset for the task were given as an XML le. The rst task was to extract features for the events from the data and represent them in terms of vectors. Word embedding algorithms were used for this representation. We tried both Word2vec3 and FastText4 [ 8 ] algorithms with varying embedding dimensions and observed that Word2vec performs better than FastText in the classi cation. Various parameters used for building the Word2vec model is given in the Table 2. We also observed that the embedding dimension beyond 300 didn't produce a signi cant change in the performance of the classi ers. The performance of Support Vector Machine (SVM) was poor and hence we concluded that, the word vectors were highly non-linearly separable. Among all the classi ers, Random Forest achieved the best training accuracy. When the model was trained with the word vectors as features, it was found that most of the data points in Class "CF" were classi ed as "F". The less number of instances in the CF class in the training data was the reason for this misclassi cation. Confusion matrix obtained for this modeling is shown in Figure 3.

Even though the model gave a good training accuracy of 90.74%, we decided to use a weighted Random Forest classi er for training with the motivation to increase the classi cation accuracy of minority class "CF". From, the confusion matrix in Figure 3, it is clear that only 43.92% of "CF" class was correctly classi ed as "CF". This may a ect the performance of the system when tested with unknown samples. Therefore, we applied a weighted Random Forest classi er. It attained an overall accuracy of 88.46% which is relatively less than the unweighted Random Forest accuracy. However, when the class-wise classi cation was analysed, most of the instances (71.76%) in "CF" were classi ed as "CF" itself. The confusion matrix for the weighted Random Forest is shown in Figure 4. The weights used for "CF", "F" and "U" were 5.68, 0.5 and 1.24 respectively which was computed using the Equation 1.

weights =

number of instances number of classes bin count (y) (1) Where bin count (y) is the number of instances in each classes.

The training performance of both unweighted and weighted Random Forest is described in Table 3. We used accuracy score, macro-Precision, macro-Recall and macro-F1-score evaluating the training model.

The shared task organizers used macro-F1-score and accuracy for evaluating the predictions of class labels for the test data. Six teams participated in the contest including the baseline system, of which our system scored the highest both in terms of macro-F1 and accuracy. The results are shown in Table 5. 4

Conclusion

The identi cation of the factuality of an event is an important task in Natural Language Understanding (NLU). The factuality of an event acts as an additional feature for many Natural Language Processing (NLP) applications like question answering and opinion detection. Automatic identi cation of an event as Fact or Counter Fact or Unde ned is a multi-class classi cation problem. In this paper, we used weighted Random Forest classi er for learning the patterns in the data which was represented using Word2vec algorithm. The model obtained an accuracy of 72.1 and an F1-score (macro) of 0.561 when tested with a set of unknown events.

1. Rudinger , Rachel, Aaron Steven White, and Benjamin Van Durme, Neural models of factuality , arXiv preprint arXiv: 1804 . 02472 ( 2018 )

2. Saur , Roser, and James Pustejovsky, Are you sure that this happened? assessing the factuality degree of events in text , Computational Linguistics , 38 ( 2 ), 261 - 299 ( 2012 )

3. Saur , Roser, A factuality pro ler for eventualities in text , Unver entlichte Dissertation , Brandeis University. Zugri auf http://www.cs.brandeis.edu/ roser/pubs/sauriDiss ( 2008 )

4. Wonsever , Dina, Marisa Malcuori , and Aiala Ros Furman, Factividad de los eventos referidos en textos, Reportes Tcnicos 09-12 ( 2009 )

5. Wonsever , Dina, Aiala

Ros , and Marisa Malcuori, Factuality

Annotation and Learning in Spanish Texts, LREC ( 2016 )

6. Mikolov , Tomas, Kai Chen, Greg Corrado, and Je rey Dean, E cient estimation of word representations in vector space , arXiv preprint arXiv:1301.3781 ( 2013 )

7. Mikolov , Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Je Dean , Distributed representations of words and phrases and their compositionality , Advances in neural information processing systems , 3111 { 3119 ( 2013 )

8. Bojanowski , Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov, Enriching word vectors with subword information , Transactions of the Association for Computational Linguistics , 5 , 135 - 146 ( 2017 )

9. Liaw , Andy, and Matthew Wiener, Classi cation and regression by randomForest , R news, 3 ; 2 ( 3 ), 18 - 22 ( 2002 )

10. Premjith , B. , Soman , K.P. , Kumar , M.A. and Ratnam , D.J , Embedding

Linguistic

Features in Word Embedding for Preposition Sense Disambiguation in EnglishMalayalam Machine Translation Context , Recent Advances in Computational Intelligence , Springer, Cham, 341 - 370 ( 2019 )

11. Xie , Yaya, Xiu

Li , E. W. T.

Ngai , and Weiyun Ying, Customer churn prediction using improved balanced random forests , Expert Systems with Applications , Elsevier, 36 ( 3 ), 5445 { 5449 ( 2009 )

12. Pedregosa , Fabian and Varoquaux, Gael and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others, Scikit-learn: Machine learning in Python , Journal of machine learning research , 12 , 2825{ 2830 ( 2011 )