A Hybrid Agent for Automatically Determining and Extracting the 5Ws of Filipino News Articles Evan Dennison S. Livelo Andrea Nicole O. Ver evan dennison livelo@dlsu.edu.ph andrea nicole ver@dlsu.edu.ph Jedrick L. Chua John Paul S. Yao Charibeth K. Cheng jedrick chua@dlsu.edu.ph john paul yao@dlsu.edu.ph charibeth.cheng@dlsu.edu.ph De La Salle University - Manila legal documents [De Araujo et al., 2013]. These doc- uments provide di↵erent types of data beneficial to Abstract people ranging from field-specific professionals to the everyday newspaper readers. Thus, from the seem- As the number of sources of unstructured ingly endless sea of unstructured data, it is important data continues to grow exponentially, man- to be able to determine the appropriate information ually reading through all this data becomes needed quickly and efficiently. notoriously time consuming. Thus, there is The process of automatically identifying and re- a need for faster understanding and process- trieving information from unstructured sources and ing of this data. This can be achieved by au- structuring the information in a usable format is called tomating the task through the use of infor- Information Extraction. This task involves the use mation extraction. In this paper, we present of natural language processing in the analysis of un- an agent that automatically detects and ex- structured sources to identify relevant data such as tracts the 5Ws, namely the who, when, where, named entities and word phrases through operations what, and why from Filipino news articles us- including tokenization, sentence segmentation, named- ing a hybrid of machine learning and linguis- entity recognition (NER), part-of-speech (POS) tag- tic rules. The agent caters specifically to the ging, and word scoring. This system is applied to var- Filipino language by working with its unique ious fields such as legal documents [De Araujo et al., features such as ambiguous prepositions and 2013], work e-mails [Xubu and Guo, 2014], and news markers, focus instead of subject and predi- articles [Cheng et al., 2016]. cate, dialect influences, and others. In order Our information extraction agent automatically ex- to be able to maximize machine learning algo- tracts the who, when, where, what, and why of Filipino rithms, techniques such as linguistic tagging news articles. Who pertains to people, groups, or or- and weighted decision trees are used to pre- ganizations involved in the main event of the news ar- process and filter the data as well as refine ticles. When refers to the date and time that the main the final results. The results show that the event of the news article occurred. Where refers to the agent achieved an accuracy of 63.33% for who, location where the main event took place. There can 71.38% for when, 58.25% for where, 89.20% for be one or more who, when, and where features in an what, and 50.00% for why. article. On the other hand, what is the main event that took place while why is the reason the main event 1 Introduction happened. There can only be one what and why for Information can be found in various types of media each article. Moreover, it is possible that there are no and documents such as news [Cheng et al., 2016] and who, when, where, what or why features in an article if one does not exist. Figure 1 shows a sample article Copyright c by the paper’s authors. Copying permitted for translated in English with the corresponding 5Ws. private and academic purposes. In: Proceedings Proceedings ofof IJCAI Workshop the 4th on Semantic International MachineonLearning Workshop Semantic However, the grammar of English and Filipino are Machine (SML 2017), Aug Learning (SML19-25 2017, 2017), Melbourne, 20th August Australia. 2017, Melbourne, not the same. Some of the nuances encountered in the AUS latter are the di↵erences in focus-subject order (i.e. personal communication, and management documents through manual information extraction rule definitions in order to determine the efficiency of strategic execu- tion. Our agent also utilizes various rules and grammat- ical information such as POS and text markers for linguistic tagging. Similarly, [Das et al., 2010] also adopted a rule-based information extraction in order to improve the overall accuracy of their information extraction system. However, unlike [De Araujo et al., 2013] and [Xubu and Guo, 2014], they also used Ma- chine Learning. They applied machine learning to their information extraction system through the use Figure 1: Sample Article Translated to English of a gold standard created by the matching answers of verb first before performer) as well as the presence of two annotators. ambiguous prepositions (i.e. “sa” can be applied to ei- In 2012, [Dieb et al., 2012] discussed how they used ther a location or a date). Moreover, due to this, auto- part-of-speech (POS) tagging as well as regular expres- matic translation of large data from Filipino to English sions to parse texts and determine orthogonal features is not feasible. Thus, our agent was designed to rec- within the considered nanodevice research documents. ognize and handle these linguistic features through a In addition, they discussed how after tokenizing and combination of machine-learned models and rule-based parsing the research papers, they made use of Yam- algorithms. Cha, a text chunk annotator, for machine learning in The results of this research can greatly benefit in- order to determine each of the parsed data category dividuals and organizations reliant on Filipino news- or tag (e.g. Source Material, Experiment Parameter) papers such that they will be able to determine and within an annotation automatically. Our agent also aggregate essential information based on main events learns by example through several machine-learned (as compared to mere presence) quickly and efficiently. classification algorithms derived from annotated Fil- Moreover, the research contributes an advancement in ipino news articles. the field of natural language processing and semantic Furthermore, in the field of Filipino news, the re- machine learning for the Filipino language. search of [Cheng et al., 2016] in 2016 extracted the 5Ws from Filipino editorials through a rule-based sys- tem in order to determine the possible candidates for 2 Related Works each W and uses weight to choose among the list of Information extraction has been performed in several candidates. They reported a performance of 6.06% previous studies dealing with a variety of languages accuracy for who, 84.39% for when, 19.51% for where, and retrieving di↵erent kinds of information. 0.00% for what, and 50.00% for why. However, the In a study by [De Araujo et al., 2013], 200 legal doc- test corpus is composed of mostly true negatives and uments written in Portuguese concerning cases that thus, there are only few examples as basis for imple- transpired in the RS State Superior Court were ana- mentation. Moreover, the candidates were subjected lyzed in order to determine the events that occurred. to minimal processing and filtering. Therefore, prob- The events examined in these documents included for- lems such as difficulty identifying correct candidates mal charges, acquittal, conviction, and questioning. In and low precision are present. addition, the study discussed how they put the legal documents through a deep linguistic parser and then 3 Information Extraction Agent Im- represented the tokens in a web ontology language or plementation OWL using a linguistic data model. Moreover, they described how after running documents through a deep Figure 2 shows the architecture of the hybrid informa- linguistic parser and converting to OWL format, they tion extraction agent. A hybrid approach was imple- formulated linguistic rules using morphological, syn- mented by means of utilizing a combination of machine tactical, and part-of-speech (POS) information and in- learning techniques and rule-based algorithms. tegrated these to domain knowledge in order to pro- A file containing a corpus of Filipino news arti- duce a generally accurate information extraction sys- cles acts as the agent’s environment. The agent scans tem. Likewise, the study of [Xubu and Guo, 2014] de- through the environment and gets all the Filipino news scribed how they extracted information from descrip- articles. Each article is then parsed and stored inter- tive text involving enterprise strategies such as e-mail, nally as a word table, which contains tokens with the For named-entity recognition, each token is eval- uated and assigned (if applicable) as a PER, LOC, DATE, or ORG. This process utilizes a Stanford NER model trained on 200 Filipino news articles. Lastly, under linguistic tagging, word scoring is per- formed. Word scoring utilizes term frequency and counts how many times a token or word is encountered in an article. 3.2 Candidate Selection Figure 2: Hybrid Information Extraction Agent Ar- Even though the articles have the named-entity tags chitecture assigned to particular words, these are not enough in- dicators of candidates. This is because named-entity corresponding position, POS and NER tags, and word tags do not consider grammatical information and, score. The word table is passed to the candidate se- consequently, common nouns. Moreover, what and lection and feature extraction module to get the final why candidates are sentence fragments that are com- who, when, where, what, and why for each article. The posed of a variety of word tokens with di↵erent part-of- results are passed to the actuator that writes the cor- speech and named-entity tags, further indicating the responding annotations to the environment, which in need for the agent to perform candidate selection. turn saves the file and generates an inverted index file To select candidates, we use a rule-based approach (Figure 3). to select possible candidates for the final who, when, where, what and why of each article. A word or phrase is a who, when and where candi- date when: 1. It is a noun or noun phrase 2. The word or phrase acts as a subject within the article 3. For proper nouns, it has a PER or ORG named- entity tag for who, DATE or TIME named-entity tag for when and LOC named-entity tag for where. Figure 3: Sample Inverted Index File 4. For common nouns, it is encapsulated by neigh- 3.1 Linguistic Tagging bouring markers including Filipino determiners, conjunctions, adverbs, and punctuations. Linguistic tagging is first applied to each news article and the parsed data is stored in a word table. The On the other hand, for the what, the agent simply body of the article is initially segmented into its com- chooses the first two sentences of the article’s body as posite sentences and then individually tokenized. Each candidates. Lastly, for the why, the agent runs through token is processed in order to determine the following the first six sentences of the article’s body. Sentences information: where why feature makers are found are considered as the why candidates of the article. 1. Part-of-Speech tag; e.g. proper noun (NNP), preposition (IN), determiner (DT) 3.3 Feature Extraction 2. Named-Entity tag, which includes person (PER), Feature extraction is then performed to narrow down location (LOC), date (DATE), and organization the candidate pool of the who, when, where, what and (ORG) why in order to get the final results. A machine-learned model was trained and used for the who, when, where 3. Word score or frequency count and why while a rule-based algorithm was developed for the what. Among the machine-learning algorithms In order to assign each token its corresponding tested include J48, Naive Bayes, and Support Vector part-of-speech tag, a tagger was implemented using a Machine. Variations were also tested using boosting, model trained on news-relevant datasets from TPOST, bagging, and stacking. Moreover, several iterations in- a Tagalog Part-of-Speech Tagger [Rabo, 2004]. volving feature engineering and parameter fine tuning were done to get the optimal results for each algorithm Table 1: Final feature sets for who, when and where based on true positive and accuracy rate among oth- ers. Who When Where Each of the who, when, where and why candidates Candidate X X X pass through a machine-learned model which deter- String mines whether or not it is a final result. The mod- els were generated using a gold standard composed No. of Words X X X of annotated Filipino news articles. Before being fed Sentence No. X X X to the machine learning algorithm, however, the gold Position X standard articles are pre-processed and filtered into Proximity X candidates as discussed previously in order to better Word Score X X X represent the data in a way such that the model can No. of Neigh- 10 3 10 establish patterns better. boring Words In order to do this, the gold standard articles were and their POS put through the same candidate selection module dis- Tags cussed previously and corresponding linguistic features were assigned to each candidate. The list of features On the other hand, for what, a weighting scheme that were tested include the following: was implemented in order to determine the best can- didate. This was done since we found that determining 1. The candidate string the what is more straightforward than the other Ws. 2. The number of words in the candidate Thus, feature engineering and fine tuning a machine learned model for this W is unnecessary and may even 3. The sentence which the candidate belongs to cause unnecessary complexities. The implementation firstly determines the presence 4. The numeric position of the candidate in the ar- of the extracted who, when, and where and adds 0.3, ticle 0.3, and 0.2 respectively to a candidate’s score. The weights were chosen after several experimental itera- 5. The distance of the candidate from the beginning tions starting with neutral arbitrary weights of all 0.5. of the sentence it belongs to The when and where are extracted in a similar way 6. The frequency count of the candidate to the who except for a few di↵erences in parameters, values, and implementation. Secondly, the sentence 7. 10 neighbouring word strings before and after the number is considered. The formula for computing the candidate additional weight based on sentence number is given below. 8. The part-of-speech tags of the aforementioned neighbouring words weight = 1 (0.2 ⇤ sentenceN umber) (1) In order to determine the class attribute (whether If the extracted who, when, and where found in the or not it is a final W), the candidate was matched candidate is present in the title, an additional 0.1 is against the annotations found in the gold standard to added to the candidate score. see if it matches. If it does, the class attribute is set The candidates are then trimmed based on the pres- to yes. Otherwise, it is set to no. These candidates, ence of a list of markers composed of Filipino adverbs and their corresponding features, were used to train and conjunctions that denote cause and e↵ect. If one several models using di↵erent algorithms for testing. of the markers are found within the candidate, the The features to be considered varied among the Ws, candidate is trimmed. If the marker found is a begin- since not all of the listed features were proven useful ning marker, all words before the marker including the in choosing the who, when and where results. marker itself are removed. On the other hand, if the Furthermore, the algorithm that showed the best marker is an ending marker, all words after the marker true positive and accuracy rate is J48 with boosting including the marker itself are removed. for who and J48 with bagging for when and where. The candidate with the highest weight is then cho- The model evaluates each candidate by assigning it sen as the final what result for that article. an acceptance probability as well as a rejection proba- Lastly, for why, the candidates first undergo trim- bility. If a candidate’s acceptance probability is higher ming and weighting. This is done since the machine- than its rejection probability, it is added to the final learned models are limited to the data that is fed to who, when and where results. them. Thus, they require an associated rule-based al- gorithm to pre-process the data before it is used for Since, based on the observations of the annotations, training or classification. the what can be found in the first two sentences, the Words that come before starting markers and after annotators found it easier to choose the annotation ending markers are removed from the candidate. The for this and thus there was more agreement. On the presence of the extracted what and the markers were other hand, because there are many possible who and also given additional weights. The final feature set for when in an article, the annotators may have had a in feature extraction of the why included the following: harder time choosing all the relevant who and when in an article thus leading to more disagreement. There 1. The candidate string is also a possibility of finding more than one possible 2. The number of words in the candidate where in an article, but based on the results it was easier for the annotators to identify the where in a 3. The sentence which the candidate belongs to given article. 4. The candidate’s weighted score 4.2 Evaluation 5. The number of who features are in the candidate After implementing the agent, the agent’s results were 6. The number of when features are in the candidate compared against the gold standard comprising of 250 articles. For the true positive value, complete matches, 7. The number of where features are in the candidate under-extracted, and over-extracted annotations were included. The results can be seen in Table 31 . 8. 10 neighbouring word strings before and after the candidate Table 3: Statistics for the who, when, where, what and 9. The part-of-speech tags of the aforementioned why neighbouring words Who When Where What Why Furthermore, the algorithm that showed the best CM 63.46% 67.53% 53.82% 40.4% 39.2% true positive and accuracy rate is J48. UE 2.41% 4.43% 4.86% 12% 9.6% OE 0.92% 0.74% 1.39% 36.8% 1.2% 4 Results and Observations CMM 33.17% 27.31% 39.93% 10.8% 50% TPCM 59.23% 35.51% 11.11% 40.4% 10.8% 4.1 Gold Standard TPPM 3.19% 5.07% 6.06% 48.8% 10.8% FP 4.78% 5.80% 21.89% 10.8% 10% In order to train and evaluate the agent, a gold stan- TN 0.91% 30.80% 41.08% 0% 28.4% dard was created. This gold standard is composed FN 31.89% 22.83% 19.87% 0% 40% of 250 Filipino news articles retrieved from the study P 92.88% 87.50% 43.97% 89.2% 68.35% of [Regalado et al., 2013]. Each article was manually R 66.18% 64.00% 46.36% 100% 35.06% annotated with 5Ws by four annotators. For each dis- A 63.33% 71.38% 58.25% 89.2% 50% agreement where only two or less annotators agree, F 77.29% 73.93% 45.13% 94.29% 46.35% the four annotators deliberated the best annotation. In the case that the decision is split, the annotation Based on the statistics shown, the when was able is discarded and left blank, denoting ambiguity. The to obtain the highest complete match rate, while the resulting annotated corpus was then qualitatively eval- why has the lowest. This was possibly because the uated by a literary expert. when had only a limited number of frequent candidates that could be seen across the news articles (i.e. seven Table 2: Inter-annotator agreement for the who, when, days in a week, twelve months, holidays, relative days), where, what and why making it easier to identify the candidates. For the who and where, both had slightly lower com- Feature Value plete match rates compared to that of the when. The Who 59.35% candidates produced seemed to be greater in number When 61.25% because of the many di↵erent possible who and where Where 71.00% across articles. The reason is that people and places What 74.40% 1 CM - Complete Match Rate; UE - Under-extracted Rate; Why 70.40% OE - Over-extracted Rate; CMM - Complete Mismatch Rate; TPCM - True Positive for Complete Match; TPPM - True Posi- tive for Partial Match; FP - False Positive; TN - True Negative; Based also on inter-annotator agreement, the who FN - False Negative; P - Precision; R - Recall; A - Accuracy; F and when proved to be more ambiguous than the rest. - F-Score of significance can change over time unlike the more run. constant when candidates. Thus, the candidate selec- We did the same experiment for the when and tion and feature extraction had a more difficult time where. For the when, the agent was able to achieve in identifying the correct who and where candidates for an accuracy of 63.35% on the first run compared to the article. 16.17% it got from the second run. For the where, the On the other hand, the what has less than half com- first run with machine learning achieved an accuracy plete matches. However, the combined number of com- rate of 58.25% in comparison to the second run with plete matches and partial matches still greatly out- an accuracy of 13.33%. number the number of complete mismatches. This is For the why, experiment results show that the ac- because during the implementation of the agent, it has curacy of the why feature when run with machine been observed that most of the what can be found in learning algorithms went up to 50%, compared to the the first two sentences of the article with 94.00% of the 47.60% accuracy it got with a rule-based feature ex- instances in the first sentence and 4.40% in the second. traction. Thus, the primary problem for the what is the trim- ming of candidates in order to completely match what Table 4: Comparison between our hybrid approach is needed (and annotated) based on the gold standard. and a rule-based approach using the data of the latter In part, it is because the linguistic structure of Filipino makes it so that sometimes, adjectives and other de- Evaluation Complete Under- scriptors become too lengthy that some important de- Metric Match Extracted tails may be considered insignificant by the agent and Hybrid Who 43.84% 2.46% are thus trimmed o↵. On the other hand, some phrases RB Who 6.06% 8.08% are not trimmed because of the presence of details that Hybrid When 59.1743% 7.7981% may be unnecessary but are considered linguistically RB When 84.39% 0% significant by the agent possibly because of misleading Hybrid Where 56.4593% 1.4354% markers. RB Where 19.51% 1.22% Moreover, the reason why the recall of the what is Hybrid What 28.0% 31.5% 100% is because the agent always extracts a what fea- RB What 0.00% 5.88% ture for each article. Since partial matches are also Hybrid Why 11% 7.5% considered as true positives, all the gold standard an- RB Why 50% 3.13% notations for what were considered extracted. Lastly, for the why, it could be observed that it ob- tained a high amount of false negatives, which shows Table 4 shows a comparison between the perfor- that the agent fails to detect the why in the article mance of our hybrid extraction agent and an existing even if one is present in the article. The agent also has rule-based extraction system [Cheng et al., 2016], us- difficulty in identifying the correct why from the can- ing the same test data. Based on the results above, our didates. This could probably be caused by the lack of agent proved to be better than the previous system for relations between the why and what candidates. The the who, when and what. linguistic structure of some articles prove to be difficult For the who and where, in terms of candidate se- because of the interchangeability of the potential what lection, the rule-based system only uses markers. On and why. Thus, the agent could get confused when a the other hand, our agent uses NER and POS tagging supposed what is actually a why which came ahead of a in addition to markers. Furthermore, for feature ex- what candidate. Moreover, text markers denoting rea- traction, our agent uses a machine-learned model as son could be misleading the agent to deciding that the compared to a weighting system to better filter out phrase that follows the aforementioned text markers candidates. is the why, which matches the extracted what when in For the what, instead of immediately constricting reality, they are only related by proximity. candidates in the candidate selection stage using mark- Furthermore, the who performed well using a ma- ers (as done in the rule-based system), our agent re- chine learning approach for its feature extraction. An trieves entire sentences and trims the markers out dur- experiment supporting this was performed. The ex- ing the feature extraction stage. Moreover, our agent periment involved comparing the final who results of utilizes other extracted features including the who, two di↵erent evaluation runs wherein the first run uti- when, where and title presence as additional weights lized the machine-learned model while the second only to better determine the final what. relied on the candidate selection module. The results For the when and the why, the results show that of the experiment show that the accuracy was 63.33% the existing rule-based feature extraction performed for the first run while it was 38.27% for the second better than the machine learning. However, if the data used to train the when was increased, it is possible to extraction and sentiment analysis. In 20th Pacific improve the results of the machine learning feature Asia Conference on Information Systems, PACIS extraction. 2016, Chiayi, Taiwan, June 27 - July 1, 2016, page 258, 2016. 5 Conclusion and Future Work [Das et al., 2010] A. Das, A. Ghosh, and S. Bandy- This paper presents a hybrid information extraction opadhyay. Semantic role labeling for bengali using agent for automatically determining the 5Ws of Fil- 5ws. In Natural Language Processing and Knowledge ipino news articles. Engineering (NLP-KE), 2010 International Confer- In conclusion, performing machine learning on who, ence on, pages 1–8, Aug 2010. when, where, and why was beneficial since the agent [De Araujo et al., 2013] D.A. De Araujo, S.J. Rigo, allows the models to choose which candidates are cor- C. Muller, and R. Chishman. Automatic informa- rect. The performance is also further supported by the tion extraction from texts with inference and lin- associated pre-processing, filtering, and refining rule- guistic knowledge acquisition rules. In Web In- based algorithms. Thus, if the model is iterated upon, telligence (WI) and Intelligent Agent Technologies the results may improve. On the other hand, using (IAT), 2013 IEEE/WIC/ACM International Joint purely rule-based selection on what is beneficial since, Conferences on, volume 3, pages 151–154, Nov 2013. based on the structure of most Filipino news articles, the what can be found in the first two sentences and [Dieb et al., 2012] T.M. Dieb, M. Yoshioka, and there are common markers that can easily denote the S. Hara. Automatic information extraction of ex- feature. periments from nanodevices development papers. The framework used in this study can be applied In Advanced Applied Informatics (IIAIAAI), 2012 in extracting other information and features such as IIAI International Conference on, pages 42–47, Sept perpetrator-victim, crime-ridden areas, businesses or 2012. companies involved in a main event, among others from news articles. However, the agent’s models and [Rabo, 2004] V. Rabo. Tpost: A template-based, n- algorithms would need to be modified for the informa- gram part-of-speech tagger for tagalog. Master’s tion. Specifically, rule-based algorithms may have a thesis, De La Salle University, 2004. di↵erent set of parameters and values while machine- [Regalado et al., 2013] R.V.J. Regalado, J.L. Chua, learned models would have to be re-trained on the do- J.L. Co, and T.J.Z. Tiam-Lee. Subjectivity clas- main corpus of the new data. Thus, the linguistic tag- sification of filipino text with features based on ging, candidate selection, and feature extraction would term frequency – inverse document frequency. In need to be tested and modified based on the aforemen- Asian Language Processing (IALP), 2013 Interna- tioned corpus. tional Conference on, pages 113–116, Aug 2013. Future work for the study include integrating anaphora resolution in order to maximize the power of [Xubu and Guo, 2014] M. Xubu and J.E. Guo. In- pronouns and other referential linguistic information. formation extraction of strategic activities based Moreover, an ontology consisting of known figures, lo- on semi-structured text. In Computational Sci- cations, positions, and organizations in the Philippines ences and Optimization (CSO), 2014 Seventh Inter- can be incorporated to possibly improve the extracted national Joint Conference on, pages 579–583, July information. Lastly, a larger and more diverse corpus 2014. of news articles can serve as examples and aid in train- ing better models and for more exhaustive evaluation. Acknowledgements The authors gratefully acknowledge the Department of Science and Technology for the support under the En- gineering Research and Development for Technology scholarship. References [Cheng et al., 2016] Charibeth Cheng, Bernadyn Cagampan, and Christine Diane Lim. Organizing news articles and editorials through information