A Hybrid Agent for Automatically Determining and
        Extracting the 5Ws of Filipino News Articles

                   Evan Dennison S. Livelo                Andrea Nicole O. Ver
               evan dennison livelo@dlsu.edu.ph       andrea nicole ver@dlsu.edu.ph
         Jedrick L. Chua                John Paul S. Yao              Charibeth K. Cheng
    jedrick chua@dlsu.edu.ph       john paul yao@dlsu.edu.ph     charibeth.cheng@dlsu.edu.ph
                                                De La Salle University - Manila


                                                                     legal documents [De Araujo et al., 2013]. These doc-
                                                                     uments provide di↵erent types of data beneficial to
                         Abstract                                    people ranging from field-specific professionals to the
                                                                     everyday newspaper readers. Thus, from the seem-
     As the number of sources of unstructured                        ingly endless sea of unstructured data, it is important
     data continues to grow exponentially, man-                      to be able to determine the appropriate information
     ually reading through all this data becomes                     needed quickly and efficiently.
     notoriously time consuming. Thus, there is
                                                                         The process of automatically identifying and re-
     a need for faster understanding and process-
                                                                     trieving information from unstructured sources and
     ing of this data. This can be achieved by au-
                                                                     structuring the information in a usable format is called
     tomating the task through the use of infor-
                                                                     Information Extraction. This task involves the use
     mation extraction. In this paper, we present
                                                                     of natural language processing in the analysis of un-
     an agent that automatically detects and ex-
                                                                     structured sources to identify relevant data such as
     tracts the 5Ws, namely the who, when, where,
                                                                     named entities and word phrases through operations
     what, and why from Filipino news articles us-
                                                                     including tokenization, sentence segmentation, named-
     ing a hybrid of machine learning and linguis-
                                                                     entity recognition (NER), part-of-speech (POS) tag-
     tic rules. The agent caters specifically to the
                                                                     ging, and word scoring. This system is applied to var-
     Filipino language by working with its unique
                                                                     ious fields such as legal documents [De Araujo et al.,
     features such as ambiguous prepositions and
                                                                     2013], work e-mails [Xubu and Guo, 2014], and news
     markers, focus instead of subject and predi-
                                                                     articles [Cheng et al., 2016].
     cate, dialect influences, and others. In order
                                                                         Our information extraction agent automatically ex-
     to be able to maximize machine learning algo-
                                                                     tracts the who, when, where, what, and why of Filipino
     rithms, techniques such as linguistic tagging
                                                                     news articles. Who pertains to people, groups, or or-
     and weighted decision trees are used to pre-
                                                                     ganizations involved in the main event of the news ar-
     process and filter the data as well as refine
                                                                     ticles. When refers to the date and time that the main
     the final results. The results show that the
                                                                     event of the news article occurred. Where refers to the
     agent achieved an accuracy of 63.33% for who,
                                                                     location where the main event took place. There can
     71.38% for when, 58.25% for where, 89.20% for
                                                                     be one or more who, when, and where features in an
     what, and 50.00% for why.
                                                                     article. On the other hand, what is the main event
                                                                     that took place while why is the reason the main event
1    Introduction                                                    happened. There can only be one what and why for
Information can be found in various types of media                   each article. Moreover, it is possible that there are no
and documents such as news [Cheng et al., 2016] and                  who, when, where, what or why features in an article
                                                                     if one does not exist. Figure 1 shows a sample article
Copyright c by the paper’s authors. Copying permitted for            translated in English with the corresponding 5Ws.
private and academic purposes.
In: Proceedings
    Proceedings ofof
                   IJCAI Workshop
                     the 4th       on Semantic
                             International     MachineonLearning
                                           Workshop       Semantic
                                                                         However, the grammar of English and Filipino are
Machine (SML 2017), Aug
          Learning   (SML19-25 2017,
                            2017),   Melbourne,
                                   20th August Australia.
                                                2017, Melbourne,     not the same. Some of the nuances encountered in the
AUS                                                                  latter are the di↵erences in focus-subject order (i.e.
                                                            personal communication, and management documents
                                                            through manual information extraction rule definitions
                                                            in order to determine the efficiency of strategic execu-
                                                            tion.
                                                               Our agent also utilizes various rules and grammat-
                                                            ical information such as POS and text markers for
                                                            linguistic tagging. Similarly, [Das et al., 2010] also
                                                            adopted a rule-based information extraction in order
                                                            to improve the overall accuracy of their information
                                                            extraction system. However, unlike [De Araujo et al.,
                                                            2013] and [Xubu and Guo, 2014], they also used Ma-
                                                            chine Learning. They applied machine learning to
                                                            their information extraction system through the use
    Figure 1: Sample Article Translated to English          of a gold standard created by the matching answers of
verb first before performer) as well as the presence of     two annotators.
ambiguous prepositions (i.e. “sa” can be applied to ei-        In 2012, [Dieb et al., 2012] discussed how they used
ther a location or a date). Moreover, due to this, auto-    part-of-speech (POS) tagging as well as regular expres-
matic translation of large data from Filipino to English    sions to parse texts and determine orthogonal features
is not feasible. Thus, our agent was designed to rec-       within the considered nanodevice research documents.
ognize and handle these linguistic features through a       In addition, they discussed how after tokenizing and
combination of machine-learned models and rule-based        parsing the research papers, they made use of Yam-
algorithms.                                                 Cha, a text chunk annotator, for machine learning in
   The results of this research can greatly benefit in-     order to determine each of the parsed data category
dividuals and organizations reliant on Filipino news-       or tag (e.g. Source Material, Experiment Parameter)
papers such that they will be able to determine and         within an annotation automatically. Our agent also
aggregate essential information based on main events        learns by example through several machine-learned
(as compared to mere presence) quickly and efficiently.     classification algorithms derived from annotated Fil-
Moreover, the research contributes an advancement in        ipino news articles.
the field of natural language processing and semantic          Furthermore, in the field of Filipino news, the re-
machine learning for the Filipino language.                 search of [Cheng et al., 2016] in 2016 extracted the
                                                            5Ws from Filipino editorials through a rule-based sys-
                                                            tem in order to determine the possible candidates for
2    Related Works
                                                            each W and uses weight to choose among the list of
Information extraction has been performed in several        candidates. They reported a performance of 6.06%
previous studies dealing with a variety of languages        accuracy for who, 84.39% for when, 19.51% for where,
and retrieving di↵erent kinds of information.               0.00% for what, and 50.00% for why. However, the
   In a study by [De Araujo et al., 2013], 200 legal doc-   test corpus is composed of mostly true negatives and
uments written in Portuguese concerning cases that          thus, there are only few examples as basis for imple-
transpired in the RS State Superior Court were ana-         mentation. Moreover, the candidates were subjected
lyzed in order to determine the events that occurred.       to minimal processing and filtering. Therefore, prob-
The events examined in these documents included for-        lems such as difficulty identifying correct candidates
mal charges, acquittal, conviction, and questioning. In     and low precision are present.
addition, the study discussed how they put the legal
documents through a deep linguistic parser and then         3   Information Extraction Agent Im-
represented the tokens in a web ontology language or
                                                                plementation
OWL using a linguistic data model. Moreover, they
described how after running documents through a deep        Figure 2 shows the architecture of the hybrid informa-
linguistic parser and converting to OWL format, they        tion extraction agent. A hybrid approach was imple-
formulated linguistic rules using morphological, syn-       mented by means of utilizing a combination of machine
tactical, and part-of-speech (POS) information and in-      learning techniques and rule-based algorithms.
tegrated these to domain knowledge in order to pro-            A file containing a corpus of Filipino news arti-
duce a generally accurate information extraction sys-       cles acts as the agent’s environment. The agent scans
tem. Likewise, the study of [Xubu and Guo, 2014] de-        through the environment and gets all the Filipino news
scribed how they extracted information from descrip-        articles. Each article is then parsed and stored inter-
tive text involving enterprise strategies such as e-mail,   nally as a word table, which contains tokens with the
                                                              For named-entity recognition, each token is eval-
                                                           uated and assigned (if applicable) as a PER, LOC,
                                                           DATE, or ORG. This process utilizes a Stanford NER
                                                           model trained on 200 Filipino news articles.
                                                              Lastly, under linguistic tagging, word scoring is per-
                                                           formed. Word scoring utilizes term frequency and
                                                           counts how many times a token or word is encountered
                                                           in an article.

                                                           3.2   Candidate Selection
Figure 2: Hybrid Information Extraction Agent Ar-          Even though the articles have the named-entity tags
chitecture                                                 assigned to particular words, these are not enough in-
                                                           dicators of candidates. This is because named-entity
corresponding position, POS and NER tags, and word
                                                           tags do not consider grammatical information and,
score. The word table is passed to the candidate se-
                                                           consequently, common nouns. Moreover, what and
lection and feature extraction module to get the final
                                                           why candidates are sentence fragments that are com-
who, when, where, what, and why for each article. The
                                                           posed of a variety of word tokens with di↵erent part-of-
results are passed to the actuator that writes the cor-
                                                           speech and named-entity tags, further indicating the
responding annotations to the environment, which in
                                                           need for the agent to perform candidate selection.
turn saves the file and generates an inverted index file
                                                              To select candidates, we use a rule-based approach
(Figure 3).
                                                           to select possible candidates for the final who, when,
                                                           where, what and why of each article.
                                                              A word or phrase is a who, when and where candi-
                                                           date when:

                                                            1. It is a noun or noun phrase

                                                            2. The word or phrase acts as a subject within the
                                                               article

                                                            3. For proper nouns, it has a PER or ORG named-
                                                               entity tag for who, DATE or TIME named-entity
                                                               tag for when and LOC named-entity tag for where.
        Figure 3: Sample Inverted Index File
                                                            4. For common nouns, it is encapsulated by neigh-
3.1   Linguistic Tagging                                       bouring markers including Filipino determiners,
                                                               conjunctions, adverbs, and punctuations.
Linguistic tagging is first applied to each news article
and the parsed data is stored in a word table. The            On the other hand, for the what, the agent simply
body of the article is initially segmented into its com-   chooses the first two sentences of the article’s body as
posite sentences and then individually tokenized. Each     candidates. Lastly, for the why, the agent runs through
token is processed in order to determine the following     the first six sentences of the article’s body. Sentences
information:                                               where why feature makers are found are considered as
                                                           the why candidates of the article.
 1. Part-of-Speech tag; e.g. proper noun (NNP),
    preposition (IN), determiner (DT)                      3.3   Feature Extraction

 2. Named-Entity tag, which includes person (PER),         Feature extraction is then performed to narrow down
    location (LOC), date (DATE), and organization          the candidate pool of the who, when, where, what and
    (ORG)                                                  why in order to get the final results. A machine-learned
                                                           model was trained and used for the who, when, where
 3. Word score or frequency count                          and why while a rule-based algorithm was developed
                                                           for the what. Among the machine-learning algorithms
   In order to assign each token its corresponding         tested include J48, Naive Bayes, and Support Vector
part-of-speech tag, a tagger was implemented using a       Machine. Variations were also tested using boosting,
model trained on news-relevant datasets from TPOST,        bagging, and stacking. Moreover, several iterations in-
a Tagalog Part-of-Speech Tagger [Rabo, 2004].              volving feature engineering and parameter fine tuning
were done to get the optimal results for each algorithm
                                                             Table 1: Final feature sets for who, when and where
based on true positive and accuracy rate among oth-
ers.
                                                                                      Who     When     Where
   Each of the who, when, where and why candidates
                                                                  Candidate            X       X        X
pass through a machine-learned model which deter-
                                                                  String
mines whether or not it is a final result. The mod-
els were generated using a gold standard composed                 No. of Words         X        X         X
of annotated Filipino news articles. Before being fed             Sentence No.         X        X         X
to the machine learning algorithm, however, the gold              Position             X
standard articles are pre-processed and filtered into             Proximity            X
candidates as discussed previously in order to better             Word Score           X        X         X
represent the data in a way such that the model can               No. of Neigh-        10       3         10
establish patterns better.                                        boring    Words
   In order to do this, the gold standard articles were           and their POS
put through the same candidate selection module dis-              Tags
cussed previously and corresponding linguistic features
were assigned to each candidate. The list of features          On the other hand, for what, a weighting scheme
that were tested include the following:                     was implemented in order to determine the best can-
                                                            didate. This was done since we found that determining
 1. The candidate string
                                                            the what is more straightforward than the other Ws.
 2. The number of words in the candidate                    Thus, feature engineering and fine tuning a machine
                                                            learned model for this W is unnecessary and may even
 3. The sentence which the candidate belongs to             cause unnecessary complexities.
                                                               The implementation firstly determines the presence
 4. The numeric position of the candidate in the ar-        of the extracted who, when, and where and adds 0.3,
    ticle                                                   0.3, and 0.2 respectively to a candidate’s score. The
                                                            weights were chosen after several experimental itera-
 5. The distance of the candidate from the beginning        tions starting with neutral arbitrary weights of all 0.5.
    of the sentence it belongs to                           The when and where are extracted in a similar way
 6. The frequency count of the candidate                    to the who except for a few di↵erences in parameters,
                                                            values, and implementation. Secondly, the sentence
 7. 10 neighbouring word strings before and after the       number is considered. The formula for computing the
    candidate                                               additional weight based on sentence number is given
                                                            below.
 8. The part-of-speech tags of the aforementioned
    neighbouring words                                              weight = 1    (0.2 ⇤ sentenceN umber)        (1)
   In order to determine the class attribute (whether          If the extracted who, when, and where found in the
or not it is a final W), the candidate was matched          candidate is present in the title, an additional 0.1 is
against the annotations found in the gold standard to       added to the candidate score.
see if it matches. If it does, the class attribute is set      The candidates are then trimmed based on the pres-
to yes. Otherwise, it is set to no. These candidates,       ence of a list of markers composed of Filipino adverbs
and their corresponding features, were used to train        and conjunctions that denote cause and e↵ect. If one
several models using di↵erent algorithms for testing.       of the markers are found within the candidate, the
The features to be considered varied among the Ws,          candidate is trimmed. If the marker found is a begin-
since not all of the listed features were proven useful     ning marker, all words before the marker including the
in choosing the who, when and where results.                marker itself are removed. On the other hand, if the
   Furthermore, the algorithm that showed the best          marker is an ending marker, all words after the marker
true positive and accuracy rate is J48 with boosting        including the marker itself are removed.
for who and J48 with bagging for when and where.               The candidate with the highest weight is then cho-
   The model evaluates each candidate by assigning it       sen as the final what result for that article.
an acceptance probability as well as a rejection proba-        Lastly, for why, the candidates first undergo trim-
bility. If a candidate’s acceptance probability is higher   ming and weighting. This is done since the machine-
than its rejection probability, it is added to the final    learned models are limited to the data that is fed to
who, when and where results.                                them. Thus, they require an associated rule-based al-
gorithm to pre-process the data before it is used for      Since, based on the observations of the annotations,
training or classification.                                the what can be found in the first two sentences, the
   Words that come before starting markers and after       annotators found it easier to choose the annotation
ending markers are removed from the candidate. The         for this and thus there was more agreement. On the
presence of the extracted what and the markers were        other hand, because there are many possible who and
also given additional weights. The final feature set for   when in an article, the annotators may have had a
in feature extraction of the why included the following:   harder time choosing all the relevant who and when in
                                                           an article thus leading to more disagreement. There
    1. The candidate string                                is also a possibility of finding more than one possible
    2. The number of words in the candidate                where in an article, but based on the results it was
                                                           easier for the annotators to identify the where in a
    3. The sentence which the candidate belongs to         given article.
    4. The candidate’s weighted score
                                                           4.2    Evaluation
    5. The number of who features are in the candidate     After implementing the agent, the agent’s results were
    6. The number of when features are in the candidate    compared against the gold standard comprising of 250
                                                           articles. For the true positive value, complete matches,
    7. The number of where features are in the candidate   under-extracted, and over-extracted annotations were
                                                           included. The results can be seen in Table 31 .
    8. 10 neighbouring word strings before and after the
       candidate                                           Table 3: Statistics for the who, when, where, what and
    9. The part-of-speech tags of the aforementioned       why
       neighbouring words
                                                                        Who      When       Where     What       Why
   Furthermore, the algorithm that showed the best           CM        63.46%    67.53%     53.82%    40.4%     39.2%
true positive and accuracy rate is J48.                      UE        2.41%     4.43%      4.86%     12%       9.6%
                                                             OE        0.92%     0.74%      1.39%     36.8%     1.2%
4     Results and Observations                               CMM       33.17%    27.31%     39.93%    10.8%     50%
                                                             TPCM      59.23%    35.51%     11.11%    40.4%     10.8%
4.1     Gold Standard                                        TPPM      3.19%     5.07%      6.06%     48.8%     10.8%
                                                             FP        4.78%     5.80%      21.89%    10.8%     10%
In order to train and evaluate the agent, a gold stan-
                                                             TN        0.91%     30.80%     41.08%    0%        28.4%
dard was created. This gold standard is composed
                                                             FN        31.89%    22.83%     19.87%    0%        40%
of 250 Filipino news articles retrieved from the study
                                                             P         92.88%    87.50%     43.97%    89.2%     68.35%
of [Regalado et al., 2013]. Each article was manually        R         66.18%    64.00%     46.36%    100%      35.06%
annotated with 5Ws by four annotators. For each dis-         A         63.33%    71.38%     58.25%    89.2%     50%
agreement where only two or less annotators agree,           F         77.29%    73.93%     45.13%    94.29%    46.35%
the four annotators deliberated the best annotation.
In the case that the decision is split, the annotation        Based on the statistics shown, the when was able
is discarded and left blank, denoting ambiguity. The       to obtain the highest complete match rate, while the
resulting annotated corpus was then qualitatively eval-    why has the lowest. This was possibly because the
uated by a literary expert.                                when had only a limited number of frequent candidates
                                                           that could be seen across the news articles (i.e. seven
Table 2: Inter-annotator agreement for the who, when,      days in a week, twelve months, holidays, relative days),
where, what and why                                        making it easier to identify the candidates.
                                                              For the who and where, both had slightly lower com-
                    Feature     Value                      plete match rates compared to that of the when. The
                   Who         59.35%                      candidates produced seemed to be greater in number
                   When        61.25%                      because of the many di↵erent possible who and where
                   Where       71.00%                      across articles. The reason is that people and places
                   What        74.40%                         1 CM - Complete Match Rate; UE - Under-extracted Rate;
                   Why         70.40%                      OE - Over-extracted Rate; CMM - Complete Mismatch Rate;
                                                           TPCM - True Positive for Complete Match; TPPM - True Posi-
                                                           tive for Partial Match; FP - False Positive; TN - True Negative;
  Based also on inter-annotator agreement, the who         FN - False Negative; P - Precision; R - Recall; A - Accuracy; F
and when proved to be more ambiguous than the rest.        - F-Score
of significance can change over time unlike the more          run.
constant when candidates. Thus, the candidate selec-              We did the same experiment for the when and
tion and feature extraction had a more difficult time         where. For the when, the agent was able to achieve
in identifying the correct who and where candidates for       an accuracy of 63.35% on the first run compared to
the article.                                                  16.17% it got from the second run. For the where, the
    On the other hand, the what has less than half com-       first run with machine learning achieved an accuracy
plete matches. However, the combined number of com-           rate of 58.25% in comparison to the second run with
plete matches and partial matches still greatly out-          an accuracy of 13.33%.
number the number of complete mismatches. This is                 For the why, experiment results show that the ac-
because during the implementation of the agent, it has        curacy of the why feature when run with machine
been observed that most of the what can be found in           learning algorithms went up to 50%, compared to the
the first two sentences of the article with 94.00% of the     47.60% accuracy it got with a rule-based feature ex-
instances in the first sentence and 4.40% in the second.      traction.
Thus, the primary problem for the what is the trim-
ming of candidates in order to completely match what          Table 4: Comparison between our hybrid approach
is needed (and annotated) based on the gold standard.         and a rule-based approach using the data of the latter
In part, it is because the linguistic structure of Filipino
makes it so that sometimes, adjectives and other de-              Evaluation        Complete       Under-
scriptors become too lengthy that some important de-              Metric            Match          Extracted
tails may be considered insignificant by the agent and            Hybrid Who        43.84%         2.46%
are thus trimmed o↵. On the other hand, some phrases              RB Who            6.06%          8.08%
are not trimmed because of the presence of details that           Hybrid When       59.1743%       7.7981%
may be unnecessary but are considered linguistically              RB When           84.39%         0%
significant by the agent possibly because of misleading           Hybrid Where      56.4593%       1.4354%
markers.                                                          RB Where          19.51%         1.22%
    Moreover, the reason why the recall of the what is            Hybrid What       28.0%          31.5%
100% is because the agent always extracts a what fea-             RB What           0.00%          5.88%
ture for each article. Since partial matches are also             Hybrid Why        11%            7.5%
considered as true positives, all the gold standard an-           RB Why            50%            3.13%
notations for what were considered extracted.
    Lastly, for the why, it could be observed that it ob-
tained a high amount of false negatives, which shows             Table 4 shows a comparison between the perfor-
that the agent fails to detect the why in the article         mance of our hybrid extraction agent and an existing
even if one is present in the article. The agent also has     rule-based extraction system [Cheng et al., 2016], us-
difficulty in identifying the correct why from the can-       ing the same test data. Based on the results above, our
didates. This could probably be caused by the lack of         agent proved to be better than the previous system for
relations between the why and what candidates. The            the who, when and what.
linguistic structure of some articles prove to be difficult      For the who and where, in terms of candidate se-
because of the interchangeability of the potential what       lection, the rule-based system only uses markers. On
and why. Thus, the agent could get confused when a            the other hand, our agent uses NER and POS tagging
supposed what is actually a why which came ahead of a         in addition to markers. Furthermore, for feature ex-
what candidate. Moreover, text markers denoting rea-          traction, our agent uses a machine-learned model as
son could be misleading the agent to deciding that the        compared to a weighting system to better filter out
phrase that follows the aforementioned text markers           candidates.
is the why, which matches the extracted what when in             For the what, instead of immediately constricting
reality, they are only related by proximity.                  candidates in the candidate selection stage using mark-
    Furthermore, the who performed well using a ma-           ers (as done in the rule-based system), our agent re-
chine learning approach for its feature extraction. An        trieves entire sentences and trims the markers out dur-
experiment supporting this was performed. The ex-             ing the feature extraction stage. Moreover, our agent
periment involved comparing the final who results of          utilizes other extracted features including the who,
two di↵erent evaluation runs wherein the first run uti-       when, where and title presence as additional weights
lized the machine-learned model while the second only         to better determine the final what.
relied on the candidate selection module. The results            For the when and the why, the results show that
of the experiment show that the accuracy was 63.33%           the existing rule-based feature extraction performed
for the first run while it was 38.27% for the second          better than the machine learning. However, if the data
used to train the when was increased, it is possible to      extraction and sentiment analysis. In 20th Pacific
improve the results of the machine learning feature          Asia Conference on Information Systems, PACIS
extraction.                                                  2016, Chiayi, Taiwan, June 27 - July 1, 2016, page
                                                             258, 2016.
5   Conclusion and Future Work                             [Das et al., 2010] A. Das, A. Ghosh, and S. Bandy-
This paper presents a hybrid information extraction          opadhyay. Semantic role labeling for bengali using
agent for automatically determining the 5Ws of Fil-          5ws. In Natural Language Processing and Knowledge
ipino news articles.                                         Engineering (NLP-KE), 2010 International Confer-
   In conclusion, performing machine learning on who,        ence on, pages 1–8, Aug 2010.
when, where, and why was beneficial since the agent
                                                           [De Araujo et al., 2013] D.A. De Araujo, S.J. Rigo,
allows the models to choose which candidates are cor-
                                                             C. Muller, and R. Chishman. Automatic informa-
rect. The performance is also further supported by the
                                                             tion extraction from texts with inference and lin-
associated pre-processing, filtering, and refining rule-
                                                             guistic knowledge acquisition rules. In Web In-
based algorithms. Thus, if the model is iterated upon,
                                                             telligence (WI) and Intelligent Agent Technologies
the results may improve. On the other hand, using
                                                             (IAT), 2013 IEEE/WIC/ACM International Joint
purely rule-based selection on what is beneficial since,
                                                             Conferences on, volume 3, pages 151–154, Nov 2013.
based on the structure of most Filipino news articles,
the what can be found in the first two sentences and       [Dieb et al., 2012] T.M. Dieb, M. Yoshioka, and
there are common markers that can easily denote the          S. Hara. Automatic information extraction of ex-
feature.                                                     periments from nanodevices development papers.
   The framework used in this study can be applied           In Advanced Applied Informatics (IIAIAAI), 2012
in extracting other information and features such as         IIAI International Conference on, pages 42–47, Sept
perpetrator-victim, crime-ridden areas, businesses or        2012.
companies involved in a main event, among others
from news articles. However, the agent’s models and        [Rabo, 2004] V. Rabo. Tpost: A template-based, n-
algorithms would need to be modified for the informa-        gram part-of-speech tagger for tagalog. Master’s
tion. Specifically, rule-based algorithms may have a         thesis, De La Salle University, 2004.
di↵erent set of parameters and values while machine-
                                                           [Regalado et al., 2013] R.V.J. Regalado, J.L. Chua,
learned models would have to be re-trained on the do-
                                                             J.L. Co, and T.J.Z. Tiam-Lee. Subjectivity clas-
main corpus of the new data. Thus, the linguistic tag-
                                                             sification of filipino text with features based on
ging, candidate selection, and feature extraction would
                                                             term frequency – inverse document frequency. In
need to be tested and modified based on the aforemen-
                                                             Asian Language Processing (IALP), 2013 Interna-
tioned corpus.
                                                             tional Conference on, pages 113–116, Aug 2013.
   Future work for the study include integrating
anaphora resolution in order to maximize the power of      [Xubu and Guo, 2014] M. Xubu and J.E. Guo. In-
pronouns and other referential linguistic information.       formation extraction of strategic activities based
Moreover, an ontology consisting of known figures, lo-       on semi-structured text. In Computational Sci-
cations, positions, and organizations in the Philippines     ences and Optimization (CSO), 2014 Seventh Inter-
can be incorporated to possibly improve the extracted        national Joint Conference on, pages 579–583, July
information. Lastly, a larger and more diverse corpus        2014.
of news articles can serve as examples and aid in train-
ing better models and for more exhaustive evaluation.

Acknowledgements
The authors gratefully acknowledge the Department of
Science and Technology for the support under the En-
gineering Research and Development for Technology
scholarship.

References
[Cheng et al., 2016] Charibeth Cheng,    Bernadyn
  Cagampan, and Christine Diane Lim. Organizing
  news articles and editorials through information