=Paper=
{{Paper
|id=Vol-3181/paper66
|storemode=property
|title=Mediaeval 2021 Emerging News: Detection of Emerging News from Live News
Stream Based on Categorization of News Annotations
|pdfUrl=https://ceur-ws.org/Vol-3181/paper66.pdf
|volume=Vol-3181
|authors=Omar Meriwani
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Meriwani21
}}
==Mediaeval 2021 Emerging News: Detection of Emerging News from Live News
Stream Based on Categorization of News Annotations==
Mediaeval 2021 Emerging News: Detection of Emerging News from Live News Stream Based on Categorization of News Annotations Omar Meriwani1, 1Scientific Editor, Real Sciences website and magazine omar.meriwani@gmail.com, omar@real-sciences.com ABSTRACT Our approach focuses on what the more frequently published news could be, and, on news annotations provided from the News This paper describes the contribution of RS_OMERIWANI in Hunter platform [4] which can provide both the news live stream the Mediaeval 2021 Emerging News task. Among the various as well as the fine categorization of named entities that are definitions of emerging news, this work is based on the definition mentioned in the news. of emerging news as the type of news that would gain more In Table 1, we can see some of the news samples with the number attention from news sources, i.e. higher frequency in publishing of times they got published during the same two-hour time the same news. Relying on the categorization of the news window. Some gain a lot of attention while others never get annotations, the classification process has been completed published by more than one or two news sources. through an unsupervised clustering to generate training data for a supervised neural network model that classifies the news based on the categories that are mentioned in it. The accuracy score for Table 1: News pieces with the number of times they got the final model was 74%, with a 65% F-Score for detecting published on different websites on 22nd October 2021 emerging news. The final model fulfilled the requirements of newsworthiness and completeness of reported events as well as News Frequenc the relevance criteria in the task evaluation. y Haitian gang leader threatens to kill 26 kidnapped missionaries Last Known Photos Of Brian Laundrie & 11 1 INTRODUCTION Gabby Petito Together UK palace says queen, 95, spent night in 14 Journalism is in an ongoing challenge of detecting news hospital for checks angles [1] that both interest the readers and satisfy the objectivity Braun Strowman Says WWE Turned Him 1 requirements of news. Giving this mission to the computer, Into A Corporate Monster requires specifying the exact meaning of emerging news, such as Record number of daily vaccinations in 1 the different concepts discussed in [2]. This paper describes the Dominican Republic EXCLUSIVE: Vicki Gunvalson Addresses 2 contribution of RS_OMERIWANI in the Mediaeval 2021 Emerging Breakup With Steve Lodget News task [3]. Having sufficient news samples, it is possible to have the This approach is based on the preference of media channels, computer perform the first part of this task, allowing for regardless of any deep analysis of the news content. We assume automation of the selection process of determining emerging that the attention that some news articles may get is based on the news. News usually gets unbalanced interest from news sources; nature of named entities it deals with, for example, the first sample some news stories get published in more than ten main sources of in Table 1 contains the following categories: news within a specific country/region/language, while some other a gang leader, an island, and members of a religious group news never gets the same level of attention, being published only (missionaries). in one or two sources. It seemed more interesting than the fourth sample which has This work is done using supervised and unsupervised machine categories combination that includes: learning models and by relying on categories to find the news that American, wrestler, World Wrestling Entertainment has higher chances of getting published more frequently in the To achieve this approach, the work is divided into two parts: media. 1- Finding similar news: Using a technique to cluster the similar news together, we created new labels based on 2 APPROACH AND METHODS the clustering results, and labelled the final training dataset with 1 or 0 based on the threshold of appearing three times or more. For example, the first new rows in Copyright 2021 for this paper by its authors. Use permitted under Creative Table 1 would be labeled as (1) and the latter three rows Commons License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online MediaEval’21, December 13-15 2021, Online Omar Meriwani would be labeled as 0. The label (1) indicates the emerging news – label (1) – is shown in Table 2, which is close to emerging news. the precision of label (0). Based on the recall score, it could be said 2- Supervised classification: Using the resulting dataset, that the model may flag many varieties of annotations’ categories we have vectorized the categories of the news as false negatives. annotations and used the vector representation as a The independent human evaluation results stated about training data for an artificial neural network to predict whether the newsworthiness and completeness requirements the labels mentioned in the previous step, either (1) for were satisfied: “Yes, the information provided brings insights that emerging news or (0) for other news. can conform an event and provides extra information with the keywords that can help to get a fast overview of the reported story. The keywords add some extra information which is not 2.1 Unsupervised news clustering present in the title, helping the journalist to better understand the News titles were transformed using a term frequency– event.”. The evaluation also described the relevance aspect: inverse document frequency (TFIDF) vectorizer [5] in order to “provides potentially relevant events for journalist or not widely create vectors that could be used in the clustering algorithm. covered events that may have not been yet seen by journalists” K-Means algorithm was used to make N/2 clusters of the original news dataset with total N samples. In that way, it’s 4 DISCUSSION assumed that the number of clusters will be no less than half the Categories of named entities or annotations in the news could total number of samples, enabling us to detect the similarity of the be used as an auxiliary feature to support more comprehensive news more accurately. models. However, the results of categories alone have could fail to Due to the computational complexity of the clustering using detect some emerging news by flagging them as false negatives. a high number of clusters, the original dataset of ~50K news Some aspects would still not be covered, such as the sentiments of samples was divided into 7 batches. verbs that indicate violence which may usually get more The final dataset included 12,667 news samples. 4,879 of attention. them had been published three times or more by different news The final output that could be extracted using this method also sources, and 7,788 news samples had only been published once or lacks some features that would make it rich enough to be more twice. The clusters were used to create labels that indicate realistic for journalists. clusters, and the news within the same cluster were labeled as (1) However, the model would still be efficient for working with or (0) according to the cluster size, the data was also balanced for poor news details; it can work regardless of the text length, links labels (0) and (1). availability, or titles format, because it is only based on the categories of the main entities. Logically it can work on many 2.2 Supervised classification cases when it deals with the abstract attributes of the named After the news dataset was created mainly by the indicators entities that are mentioned in the news. provided by K-means clustering, the results were ready for supervised learning model that can classify the data using different REFERENCES set of features, namely, the categories of news annotations. The News Hunter platform already provides annotations for the main named entities and a set of classes for these annotations. We used a [1] Marc Gallofré Ocaña and Andreas Lothe Opdahl. 2020. count vectorizer for each news’ categories set. Challenges and opportunities for journalistic knowledge We used a multiple-layers perceptron with hidden layers sizes platforms. Proceedings of the CIKM 2020 Workshops. of (20,20,20). The data was divided into 2:8 for testing and training. Galway, Ireland. The output format was then structured by returning the news titles of emerging news as well as the keywords that were extracted [2] Marc Gallofré Ocaña, Lars Nyre, Andreas Lothe Opdahl, using the Single Rank algorithm [6] which is implemented in the Bjørnar Tessem, Christoph Trattner, Csaba Veres. 2018. Kex Python package. Towards a big data platform for news angles. The 4th Norwegian Big Data Symposium (NOBIDS). 3 RESULTS [3] Marc Gallofré Ocaña, Andreas L. Opdahl and Duc-Tien Dang-Nguyen. 2021. Emerging News task: Detecting Table 2: Frequency of Special Characters emerging events from social media and news feeds. MediaEval’21: Multimedia Evaluation Workshop. Label Precision Recall F-Score [4] Arne Berven, Ole A. Christensen, Sindre Moldeklev, Andreas L. Opdahl, and Kjetil J. Villanger. 2018. News Hunter: 0 0.76 0.86 0.8 building and mining knowledge graphs for newsroom 1 0.72 0.58 0.65 systems. NOKOBIT. The accuracy achieved by the model was 74%, while the F- score for emerging news detection was 65%. Precision for Emerging News: Detecting emerging stories from social media and news feeds Omar Meriwani [5] T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization," Carnegie- mellon univ pittsburgh pa dept of computer science, 1996. [6] C. G. Broyden, "The convergence of single-rank quasi- Newton methods," Mathematics of Computation, no. 24.110 , pp. 365-382, 1970. 3