A Prototypical Semantic Annotator for A Tribuna Newspaper Elias de Oliveira1 , Wesley Silva1 Juliana P. C. Pirovani2 , Jean-Rémi Bourguet3 1 Programa de Pós-Graduação em Informática Universidade Federal do Espı́rito Santo Av Fernando Ferrari, 514, Goiabeiras – Vitória, ES 29075-910 2 Departamento de Computação Universidade Federal do Espı́rito Santo Alto Universitário, s/n – Guararema – Alegre, ES 29500-000 3 Departmento de Ciência da Compuação, Universidade Vila Velha, Vila Velha, Brasil jean-remi.bourguet@uvv.br, {elias,juliana}@lcad.inf.ufes.br Abstract. The issue of recommending an appropriate piece of information has become essential for the news portals. In this context, a well founded ontologi- cal layer represents actually an indispensable artifact to suggest relevant news for the readers. However, news agencies still need to mine their data in order to discover valuable knowledge. In this paper, we present a prototypal auto- matic semantic annotator for the regional Brazilian newspaper called A Tri- buna. Founded on a set of inductive algorithms allowing to classify newspapers in Portuguese and extract named entities from them, our approach describes the standardized categorization and the semantic matching with DBpedia, the so- called nucleus of the Linked Open Data Cloud. We discuss the limitation of our prototype and draw some challenging perspectives to face them. Finally, our proposal paves the way to a new kind of recommendation-based systems. 1. Introduction Nowadays, we surely can encounter a large amount of information from various news portals around the world. News agencies have become modern mining platforms by con- tinuously gathering new facts and producing new narratives into their printed or digital materials. One can use this font to look into what is happening in a city, state, or country. The traditional way of reading newspapers is by browsing their pages in order to discover interesting items. Nevertheless, with the recent development of advanced methodolog- ical and computational artifacts, the newspaper portals grow into producers of sugges- tions for their readers. As mentioned, this huge bunch of data has to be well managed to perform accurate recommendations by selecting particular items and discarding oth- ers. For example, if a given user explicitly indicates a preference for a kind of news concerning innovations in Information Technology, the recommendation system may rec- ommend some new Artificial Intelligence-based tools. Such an approach is founded on content-based filtering [Pazzani and Billsus 2007]. On the other hand, it exists a collab- orative filtering approach [Herlocker et al. 2004] in which the system generates a group of similar users in terms of interests producing a recommendation based on the analysis Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). of their characteristics. In order to be able to release a competitive recommendation- based system, a well founded ontological layer represents an indispensable contemporary artifact [Cantador et al. 2008]. Indeed, by taking advantage of accurate semantic annota- tions [Wetzker et al. 2009] and domain-specific semantic networks [Nguyen et al. 2014], such systems are able to recommend items which a priori would not be revealed through classical Vector Space Model techniques [Baeza-Yates and Ribeiro-Neto 2011]. In [Branquinho-Filho and Oliveira 2017], an experimental approach is released to classify journalistic documents published on a newspaper A Tribuna. Founded in 1938, A Tribuna is currently the main regional newspaper of Espı́rito Santo - Brazil. Later, another work extended this approach by providing a framework extracting named enti- ties [Pirovani and Oliveira 2017]. In order to enrich semantically the news metadata, a Named Entity Recognizers (NER) is classically used to map elements of a document with some well known instances on the Semantic Web (see [Troncy 2008] for example). Al- though complex, NER is an important task to the Natural Language Processing (NLP). It can be understood as the task of identification and annotation of the Named Entities (NEs) in free-written text corpora. In turn, a NE can be defined as a sequence of words that is capable of representing a real-world entity [Zhang et al. 2019]. In Portuguese, few works have tackled NER-based tasks [Pirovani et al. 2019]. Among these proposals, the hybrid approach CRF+LG proposed in [Pirovani and Oliveira 2017] outclassed the results obtained by other systems performing under equivalent conditions. Conditional Random Fields (CRF) is a supervised statistical algorithm to predict an output vector y = {y0 , y1 , · · · , yT } based on the random variables given an observed features’ vec- tor x, an input vector of features {x0 , x1 , x2 , · · · , xT } [Sutton and McCallum 2011]. One of the features, ys , can conceptualize the NE class. Then, the goal is to maximize the number of labels ys ∈ y that are correctly classified, mapping x 7→ ys , for each s. The CRF+LG algorithm works first classifying the entities using Local Grammars (LG), us- ing the Unitex tool1 . In addition to post-tagging structures, the LG result is used as an additional feature for training the CRF model. The key strength of this approach is the combination of the probabilistic and linguistic models. Even if the newspaper texts have a generic structure, they gather heterogeneous data with content related to economics, pol- itics, sports, entertainment, among others. The literature already pointed out that the NEs present in these texts can represent efficient supports for information retrieval, whether used as indexing items [Pirovani et al. 2018], for clustering tasks [Spalenza et al. 2019], for classification tasks [Nadeau and Sekine 2007], or for automatic question generations [Pirovani et al. 2017]. In this work, we present a prototypal automatic semantic annotator for A Tribuna. Supported by the aforementioned seminal proposals, our approach describes a standard- ized categorization of the news articles and a semantic matching of their contents with DBpedia. The rest of this paper is structured as follows: Section 2 presents some related works. Section 3 describes the upstream inductive approach of our automatic annotator while Section 4 presents the downstream semantic categorization and matching. Finally, Section 5 presents the main conclusions so far achieved at this phase of our prototype release. 1 https://unitexgramlab.org/ 2. Related Works Semantic annotation is not a novelty for the journalists. For a long a time, they have been using their proprietary semantic tools to manually annotate contents of news items by filling in some forms. For example, BBC created its own ontology2 for modeling its own news articles. Early in 2000, the project PlanetOnto [Domingue and Motta 2000] extended a news server by providing support for ontology-driven document formalization integrat- ing browsing and deductive knowledge retrieval, personalized news feeds and alerts, and proactive identification of potentially interesting news items. Soon after, NAMIC (News Agencies Multilingual Information Categorisation) [Basili et al. 2001] was released as an architecture to extract relevant facts from the news streams of large European news agencies and to support semantic inferences by aligning the extracted concepts with Eu- roWordNet objects [Vossen 1998]. Neptuno [Castells et al. 2004] presents an emergent semantic-based technologies to improve the processes of creation, maintenance, and ex- ploitation of the digital archive of a newspapers based on a knowledge base supported by an ontology for the description of journalistic information, a semantic search module and a module for content browsing and visualization. PENG (Personalised nEws coNtent programminG) [Pasi et al. 2006] provides a set of functionalities for gathering, classify- ing and filtering heterogeneous news materials (television, radio, magazine) considering a number of individual interests. Indeed, news items can be represented in a lot of formats like XML-based ones for instance. Nevertheless, News Industry Text Format (NITF) or NewsML remains two of the most widespread formats standardized by the International Press Telecommunications Council (IPTC). Thus, a relevant initiative related in [Troncy 2008] tries to bring the IPTC news architecture into the Semantic Web by designing a workflow to populate computa- tional ontologies and enriching semantically the news metadata by processing text and performing visual analysis of photo and video of news items. Named Entity Recognizers such as GATE3 , SPROUT4 or OpenCalais5 have been used. Once the named entities are extracted, they are mapped to well known instances on the web (using Geonames for the locations and DBPedia for the persons and organizations). As it is mentionned, the field of newspapers has already been tackled by approaches dealing with artificial intelligence and semantic web techniques in reason of potential social fallouts. For example, some works relate how the journalists can use the Semantic Web standards to support the news angles creation and news pro- duction [Heravi et al. 2012, Opdahl and Tessem 2020, Panagiotidis and Veglis 2020]. But, in [Moreno et al. 2015], the authors point out a certain distance between metadata standards identified in the literature review and those in the HTML tags of the newspaper industry emphasizing the importance and needs of semantic alignments. Different initiatives have proposed ontological-based infrastructures in the journalism domain focused on linked open data-based strategies. 2 http://www.bbc.co.uk/ontologies/storyline 3 http://gate.ac.uk/ 4 http://sprout.dfki.de/ 5 http://www.opencalais.com/ In [Hopfgartner and Jose 2010], the authors extract named entities from news videos teletext using OpenCalais. Moreover, OpenCalais WebService is used to compare the actual entity string with an up-to-date database of entities and their spelling variations. This disambiguation maps these entities with a Uniform Resource Identifier (URI) and their representation in DBpedia. Then, the authors exploit the Linked Open Data Cloud (i.e. by using the SKOS vocabulary in DBpedia) to identify similar news stories that match the users interest. In [Aksaç et al. 2012], the authors release a semantic web browser that allows users to browse news web pages and also access related data resources via annota- tion and a side-bar listing all found linked data resources. In [Papadokostaki et al. 2017], the authors present an integrated platform dedicated to news articles, providing storage, indexing and searching functionalities by using semantic web technologies and services. Finally, one of the last and most complete proposal was realized through the Eu- ropean project NEWS (News Engine Web Services) [Garcı́a et al. 2006] consisting of a set of facilities like automatic extraction of metadata from news items’ contents, named entity disambiguation [Garcı́a et al. 2012], storage, retrieval of news items. Their NEWS ontology [Garcı́a et al. 2010] covers the different types of metadata that can be attached to a news item: management, categorization and content metadata. The project also pro- duced components and algorithms [Garcı́a et al. 2007] that automatically detect entities and events mentioned in a newspaper text and link them to instances in their NEWS on- tology. Notice that the content annotation module of the NEWS ontology is partially inspired by SUMO [Niles and Pease 2001] and MILO [Niles and Terry 2004]. 3. Upstream Inductive Approach Our goal has two folds. Firstly, we want to assign one of the twenty-one possible subject topics to the news article: 1) Atualidades (Current Affairs), 2) Qual a Bronca? (What’s up?), 3) Cidades (Cities), 4) Ciência e Tecnologia (Science and Technology), 5) Concur- sos (Public-Exam Competitions), 6) Economia (Economy), 7) Esporte (Sports), 8) Espe- cial (Special), 9) Famı́lia (Family), 10) Imóveis (Real State), 11) Informática (Computers & Electronics), 12) Internacional (International), 13) Minha Casa (My Home), 14) Mul- her (Woman) 15) Opinião (Opinion), 16) Polı́cia (Police), 17) Polı́tica (Politics), 18) Re- gional (Regional County), 19) Sobre Rodas (On Wheels), 20) Tudo a Ver (Everything to do with), 21) TV Tudo (All TV). To accomplish this goal, we will apply a similar but im- proved approach used in [Branquinho-Filho and Oliveira 2017], where each news article document was turned into a vector of weighted word-frequency, known as the bag-of- words approach. Secondly, we use some NLP tools and methodologies to perform the meta- information extraction from the news free-text format. The meta-information we are interested in are any of the five possible named entities: 1) Organization (ORG), 2) Per- son (PER), 3) Local (PLC), 4) Time (TME), and 5) Value (VAL), mentioned in the news- article texts. The news-article objects are all in PDF format at the site https://tribunaonline.com.br/. So, we downloaded them and extracted 45,908 arti- cles to perform the undermentioned algorithms. 3.1. The Classification of Topics Problem The classification of documents is a hard task these days of information overload [Bawden and Robinson 2009]. The problem we have at hand is to deal with 45,908 news articles, a tiny portion of the total newspaper archive of only one information source. Hence, arguments in favor of automation of this activity are unnecessary. Table 1 shows the number of documents in each class of this data set used for our experiments. Class #Docs Class #Docs Class #Docs Atualidades 5617 Especial 1470 Opinião 1634 Qual a Bronca? 346 Famı́lia 442 Polı́cia 4671 Cidades 5234 Imóveis 124 Polı́tica 5918 Ciência e Tecnologia 470 Informática 1506 Regional 1802 Concursos 309 Internacional 2187 Sobre Rodas 352 Economia 6558 Minha Casa 37 Tudo a Ver 30 Esporte 6657 Mulher 103 TV Tudo 440 Table 1. The number of documents within each class. While in the first, third, and fifth columns, we show the name of those classes also presented in the introductory part of Section 3, in the second, fourth, and sixth columns the quantities of documents for their respective class. Note that the majority of news article documents are about Esporte (Sports) class, with 6,657 files – shown in the last line of the first column. Whereas, the least populated is Tudo a ver, with only thirty document files, in the line before the last, in the fifth column. After transforming each news article in a weighted word-frequency vector, we se- lected a sample of these documents to serve as training, and the remaining used to test the classification algorithm. Within the training sample, we also subdivide this sample into two sets: one for the actual training and another for validation, plying as testing to maxi- mize the classifier performance. We used the Gradient Boosting algorithm implemented into the scikit-learn6 for the Python programming language. The goal is thus to mimic humans assigning each news articles to the already known subject class to which they belong. The process took all together less than fifty minutes, and the quality measured – more than 98% of accuracy – is much encouraging, as to the best of our knowledge we do not know a similar performance figure to do a pair comparison when a similar task is carried out totally by humans. 3.2. The Named Entity Recognition Problem Mining pieces of meta-information from free-texts is still a challenging task in the academia and industry [Augenstein et al. 2017, Nadeau and Sekine 2007]. Researchers are trying to catch up with the performance level reached by NERs systems in English also for the Portugues language [Collovini et al. 2019]. A hybrid approach combining CRF+LG proposed in [Pirovani et al. 2019] is the same strategy adopted in our experiments. This approach, besides the advantage already 6 https://scikit-learn.org/stable/ mentioned in Section 1, requires lesser training data to achieve competitive results. For the sake of illustration, in our experiments, we manually annotated 100 newspaper documents of the A Tribuna7 dataset. In total, we found 1,354 NEs of the class Person, for instance. In Table 2, we depicted the results when using 80% of documents for training and the remaining for test. Method Metrics P R F1 LG 60.72 46.41 52.61 CRF+LG 55.00 58.97 56.92 ORG 18.17 47.35 26.26 PER 58.64 70.45 64.01 PLC 43.24 41.27 42.23 TME 78.76 79.92 79.34 VAL 53.80 33.21 41.07 Table 2. Results of automatic extracting NEs The first column, in Table 2, shows the strategy used to yield the results in the following columns. The LG and the CRF+LG approaches are respectively in the third and fourth lines. From the fifth line onward, are the best results by the CRF+LG approach. The Precision, Recall, and F1 metrics are respectively in the third, fourth, and fifth columns. The worst results for F1, an average of 26.26%, is that carried out by the CR+LG for the ORG entity class, in the fifth line. The best figure for recall, 79.34%, in the fifth column, eighth line, is obtained for the TME class. The LG was superior to the CRF+ÇG in precision, but the later was superior in all the remaining metrics. Once the newspaper item output the upstream inductive approach, our prototype will perform standardized categorizations and semantic matchings with DBpedia. 4. Downstream Semantic Alignments The initial approach presented in the Section 3 was thought to support an internal catego- rization of the news article for archival data storage. As presented in the subsection 3.1, there were 21 selected categories. Our approach plugs a semantic annotator downstream of the inductive algorithms. The IPTC NewsCodes defining 36 thesauri, we will focus on items subjects consisting of about 1400 terms organized into a taxonomy of three levels. Each Subject Reference is identified by an eight-digit decimal string. The terms are orga- nized in a taxonomy as described in the Figure 1. Originally, the subsumption relationship is not explicit but instead encoded into the coding scheme identifying the terms. For ex- ample, ”survey” (subj:13006001) is narrower than ”research” (subj:13006000) which is narrower than ”science and technology” (subj:13000000) because they share the two and the four first digits. 7 http://www.inf.ufes.br/∼elias/dataSets/aTribuna-21dir.tar.gz Figure 1. IPTC NewsCodes Subjects Code Taxonomy IPTC shares its Controlled Vocabularies (CV) by a server at http: //dev.iptc.org/NewsCodes-CV-Server. The users can use this server for the retrieval of full CVs or only single concepts. The datasets of the CVs and concepts are delivered in five different formats: HTML as human readable variant, and NewsML-G2 Knowledge Items (XML), RDF/XML or RDF/Turtle and JSON/JSON- LD as primarily machine readable variants. Thus, the thesauri are extractable into SKOS, an application of RDF, making the subsumption relationships explicit (i.e. skos:narrower, skos:broader). Each term is thus identified by a dereferencable URI. Moreover, SKOS allows to encode other semantic relations with other controlled vocabularies (from IPTC or from other vocabularies in the LOD). For example, the two triples presented below and extracted from the IPTC News- Codes Subjects vocabulary called cptall-en-GB.rdf described the fact that the topic scientific research (scientific and methodical investigation of events, procedures and in- teractions to explain why they occur, or to find solutions for problems) with the URI medtop:20000735 is exactly similar to the subject research (a methodical investiga- tion of events or procedures to explain why they occur, or to find solutions for problems) with the URI subj:13006000; and presents a semantic similarity with the subject sur- vey (examination of public attitudes on various subjects or issues, such as the quality of goods, the value of services) with the URI subj:13006001. subj:13006000 skos:exactMatch medtop:20000735 subj:13006001 skos:closeMatch medtop:20000735 One of the objectives of our framework is to provide an automatic semantic news categorization. Our semantic categorization module is producing assertions by us- ing a own made controlled vocabulary called cjat.rdf. In this vocabulary, we en- coded some alignments between the subjects from a Tribuna (those presented in Ta- ble 1) with the IPTC NewsCodes Subjects. In Figure 2, the vocabularies cjat.rdf, cptall-en-GB.rdf and their alignments are visualized by the sparna-labs skos- play 8 . Figure 2. Alignments in cjat.rdf All the outputs obtained by the NER algorithm presented in previous section will be used as potential matchers with resources on DBpedia. We use the DBpedia Lookup Service9 to look up DBpedia URIs by related keywords (label or anchor text). Note that, in this prototype version, we retain the first ranked resource as matcher. Once the type of the article is encoded and all the matchers are gathered, our writer module will build the RDF representation of the news article. The URI of the article is formed by the concatenation of the namespace da Tribuna with the date of the edition, the beginning 8 http://labs.sparna.fr/skos-play/ 9 https://wiki.dbpedia.org/lookup/ page and an integer representing the id of the article in this page. In the Figure 3, we present an article published on the 1st of March 2019 in the fourth page and identified by the URI https://tribunaonline.com.br/010319p04a2. After inputting the automatic semantic annotator and then the writer module, the RDF excerpt is produced. We chose to use the BBC Creative Work Ontology 10 to support this phase. The reasons are multiple: this is a core ontology well established and recognized, the object property category (resp. tag) allows to encode properly the categorization of the article (resp. the relation with the extracted NEs) and alignments are already encoded towards other vocabularies. At the end of our current workflow, the RDF Graph of the article is obtained by using the ontology-visualization API11 . Figure 3. Prototype Even if our approach is promising, some important issues have to be solved. A disambiguation can be required to select the right matcher. For exam- ple, http://dbpedia.org/page/Vale maps actually to more than 30 possible resources. Note that, such a disambiguation process could also be performed among the ranked output resources of the DBpedia Lookup output. We intend to browse a small part of the DBPedia semantic network in order to find some evidence to select (or not) the best matcher. Another recurrent issue is the absence of the proper resources in DBpedia but not in Wikipedia. In the news presented above, it was the case for https://pt.wikipedia.org/wiki/EF-118, https://pt. wikipedia.org/wiki/Estrada_de_Ferro_Vit%C3%B3ria_a_Minas and 10 https://www.bbc.co.uk/ontologies/creativework 11 https://github.com/fatestigma/ontology-visualization https://pt.wikipedia.org/wiki/Ferrovia_Litor%C3%A2nea_Sul. As wikipedia webpages represent semi-structured semantic datasets, it could be interest- ing to process them also. Finally, we plan to perform matchings with other dataset among the LOD Cloud. An intuitive perspective would be to perform the disambiguations of the locations by consulting Geonames or Geoplanet in addition. Implementing a system of vote among the matchers could support such a task. 5. Conclusions In this paper, we presented a prototypal semantic annotator from a free-text archive of the regional Brazilian newspaper called A Tribuna. Founded on a set of inductive algorithms allowing to classify newspapers in Portuguese and extract NEs from them, our approach describes both an upstream inductive approach and a downstream semantic categorization alignments with IPTC NewsCodes Subjects taxonomy and NEs matching with DBpedia. We also discussed the current limitation of our prototype and draw some challenging perspectives to face them. Nevertheless, up to this stage, we are ready to automatically produce recommendations supported by an ontological layer browsing. In our example, we could easily imagine to recommend news concerning past governors by using the richness of the semantic network in DBpedia. For future work, we intend to improve the non-symbolic part of our approach. This is a promising line of research especially with respect to the minimization of the manpower efforts [Oliveira et al. 2014, Spalenza et al. 2019]. We also intend to deal with a larger and more precise NER to improve the quality of the performance of our algo- rithms [Collovini et al. 2019, Pirovani et al. 2019]. We also project to deal with other do- mains like institutional repositories, educational materials and jobs offerings. Mentioning this, one of the possible horizons in terms of potentially suitable products would be to release a Graphical User Interface supporting the Semantic Web ideals [Bourguet 2017]. References Aksaç, A., Ozturk, O., and Dogdu, E. (2012). A Novel Semantic Web Browser for User Centric Information Retrieval: PERSON. Expert Syst. Appl., 39(15):12001–12013. Augenstein, I., Derczynski, L., and Bontcheva, K. (2017). Generalisation in Named Entity Recognition: A Quantitative Analysis. Computer Speech & Language, 44:61 – 83. Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval. Addison- Wesley, New York, 2 edition. Basili, R., Catizone, R., Padró, L., Pazienza, M. T., Rigau, G., Setzer, A., Webb, N., Wilks, Y., and Zanzotto, F. M. (2001). Multilingual Authoring: the NAMIC Approach. In Proceedings of the Workshop on Human Language Technology and Knowledge Man- agement@ACL 2001, Toulouse, France, July 9-11, 2001. Bawden, D. and Robinson, L. (2009). The Dark Side of Information: Overload, Anxiety and other Paradoxes and Pathologies. J. Inf. Sci., 35(2):180–191. Bourguet, J. (2017). Worldwide scholarships spreading. In Rus, V. and Markov, Z., editors, Proceedings of the Thirtieth International Florida Artificial Intelligence Re- search Society Conference, FLAIRS 2017, Marco Island, Florida, USA, May 22-24, 2017, pages 670–675. AAAI Press. Branquinho-Filho, D. and Oliveira, E. (2017). Automatic Classification of Journalistic Documents on the Internet. TransInformação, 29(3):245–255. Cantador, I., Bellogı́n, A., and Castells, P. (2008). A Multilayer Ontology-Based Hybrid Recommendation Model. AI Commun., 21(2-3):203–210. Castells, P., Perdrix, F., Pulido, E., Rico, M., Benjamins, V. R., Contreras, J., and Lorés, J. (2004). Neptuno: Semantic Web Technologies for a Digital Newspaper Archive. In Bussler, C., Davies, J., Fensel, D., and Studer, R., editors, The Semantic Web: Research and Applications, First European Semantic Web Symposium, ESWS 2004, Heraklion, Crete, Greece, May 10-12, 2004, Proceedings, volume 3053 of Lecture Notes in Com- puter Science, pages 445–458. Springer. Collovini, S., Santos, J., Consoli, B., Terra, J., Vieira, R., Quaresma, P., Souza, M., Claro, D. B., Glauber, R., and a Xavier, C. C., editors (2019). Portuguese Named Entity Recognition and Relation Extraction Tasks at IberLEF 2019, CEUR Workshop Pro- ceedings. CEUR-WS.org. Domingue, J. and Motta, E. (2000). PlanetOnto: from News Publishing to Integrated Knowledge Management Support. IEEE Intelligent Systems and their Applications, 15(3):26–32. Garcı́a, N. F., Arias-Fisteus, J., Sánchez, L., and López, G. (2012). IdentityRank: Named Entity Disambiguation in the News Domain. Expert Syst. Appl., 39(10):9207–9221. Garcı́a, N. F., del Toro, J. M. B., Arias-Fisteus, J., Sánchez, L., Sintek, M., Bernardi, A., Fuentes, M., Marrara, A., and Ben-Asher, Z. (2006). NEWS: Bringing Semantic Web Technologies into News Agencies. In Cruz, I. F., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., and Aroyo, L., editors, The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings, volume 4273 of Lecture Notes in Computer Science, pages 778–791. Springer. Garcı́a, N. F., del Toro, J. M. B., Sánchez, L., and Bernardi, A. (2007). IdentityRank: Named Entity Disambiguation in the Context of the NEWS Project. In Franconi, E., Kifer, M., and May, W., editors, The Semantic Web: Research and Applications, 4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, June 3-7, 2007, Proceedings, volume 4519 of Lecture Notes in Computer Science, pages 640–654. Springer. Garcı́a, N. F., Fuentes, D., Sánchez, L., and Arias-Fisteus, J. (2010). The NEWS Ontol- ogy: Design and Applications. Expert Syst. Appl., 37(12):8694–8704. Heravi, B. R., Boran, M., and Breslin, J. (2012). Towards Social Semantic Journalism. In Sixth International AAAI Conference on Weblogs and Social Media. Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. (2004). Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems (TOIS), 22(1):5–53. Hopfgartner, F. and Jose, J. M. (2010). Semantic User Profiling Techniques for Person- alised Multimedia Recommendation. Multimedia Syst., 16(4-5):255–274. Moreno, M. J. B., Felipe, E. R., Sánchez, J. A. P., Béjar, R. M., and Lima, G. (2015). Metadatos en noticias: un análisis internacional para la representación de contenidos en periódicos. In II Congreso ISKO España-Portugal. Organización del conocimiento: sistemas de información abiertos, pages 290–303. Universidad de Murcia. Nadeau, D. and Sekine, S. (2007). A Survey of Named Entity Recognition and Classifi- cation. Lingvisticæ Investigationes, 30(1):3–26. Nguyen, T. T. S., Lu, H., and Lu, J. (2014). Web-Page Recommendation Based on Web Usage and Domain Knowledge. IEEE Trans. Knowl. Data Eng., 26(10):2574–2587. Niles, I. and Pease, A. (2001). Towards a Standard Upper Ontology. In 2nd International Conference on Formal Ontology in Information Systems, FOIS 2001, Ogunquit, Maine, USA, October 17-19, 2001, Proceedings, pages 2–9. ACM. Niles, I. and Terry, A. (2004). The MILO: A General-purpose, Mid-level Ontology. In Arabnia, H. R., editor, Proceedings of the International Conference on Information and Knowledge Engineering. IKE’04, June 21-24, 2004, Las Vegas, Nevada, USA, pages 15–19. CSREA Press. Oliveira, E., Basoni, H. G., Saúde, M. R., and Ciarelli, P. M. (2014). Combining Cluster- ing and Classification Approaches for Reducing the Effort of Automatic Tweets Clas- sification. In 6th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Rome, Italy. IC3K. Opdahl, A. L. and Tessem, B. (2020). Ontologies for Finding Journalistic Angles. Soft- ware and Systems Modeling, pages 1–17. Panagiotidis, K. and Veglis, A. (2020). Transitions in Journalism–Toward a Semantic- Oriented Technological Framework. Journal. Media, 1:1. Papadokostaki, K., Charitakis, S., Vavoulas, G., Panou, S., Piperaki, P., Papakonstantinou, A., Lemonakis, S., Maridaki, A., Iatrou, K., Arent, P., et al. (2017). News Articles Platform: Semantic Tools and Services for Aggregating and Exploring News Articles. In Strategic Innovative Marketing, pages 511–519. Springer. Pasi, G., Bordogna, G., and Villa, R. (2006). The PENG System: Practice and Experience. In 17th International Workshop on Database and Expert Systems Applications (DEXA 2006), 4-8 September 2006, Krakow, Poland, pages 445–449. IEEE Computer Society. Pazzani, M. J. and Billsus, D. (2007). Content-Based Recommendation Systems. In Brusilovsky, P., Kobsa, A., and Nejdl, W., editors, The Adaptive Web, Methods and Strategies of Web Personalization, volume 4321 of Lecture Notes in Computer Science, pages 325–341. Springer. Pirovani, J., Alves, J., Spalenza, M., Silva, W., Silveira Colombo, C., and Oliveira, E. (2019). Adapting NER (CRF+LG) for Many Textual Genres. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), volume 2421 of CEUR Work- shop Proceedings, pages 421–433, Bilbao, Spain. CEUR-WS.org. Pirovani, J., Nogueira, M., and Oliveira, E. (2018). Indexing Names of Persons in a Newspaper Large Dataset. In 13th International Conference on the Computational Processing of Portuguese (PROPOR), volume 11122, Canela, RS. Springer. Pirovani, J. and Oliveira, E. (2017). CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition. In 17th International Conference on Intelligent Systems Design and Applications: Intelligent Systems Design and Applications, pages 102– 113, Delhi, India. Springer, Springer International Publishing. Pirovani, J., Spalenza, M., and Oliveira, E. (2017). Geração Automática de Questões a Partir do Reconhecimento de Entidades Nomeadas em Textos Didáticos. In XXVIII Simpósio Brasileiro de Informática na Educação (SBIE), pages 1147–1156, Ceará, CE. SBC. Spalenza, M., Pirovani, J., and Oliveira, E. (2019). Structures Discovering for Optimizing External Clustering Validation Metrics. In 19th International Conference on Intelligent Systems Design and Applications: Intelligent Systems Design and Applications, pages 102–113, Delhi, India. Springer, Springer International Publishing. Sutton, C. and McCallum, A. (2011). Conditional Random Fields: An Introduction. Foundations and Trends® in Machine Learning, 4:267–373. Troncy, R. (2008). Bringing the IPTC News Architecture into the Semantic Web. In International Semantic Web Conference, pages 483–498. Springer. Vossen, P. (1998). A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers. doi, 10:978–94. Wetzker, R., Umbrath, W., and Said, A. (2009). A Hybrid Approach to Item Recom- mendation in Folksonomies. In Proceedings of the WSDM’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, pages 25–29. Zhang, H., Boons, F., and Batista-Navarro, R. (2019). Whose Story is It Anyway? Au- tomatic Extraction of Accounts from News Articles. Information Processing & Man- agement, 56(5):1837 – 1848.