=Paper=
{{Paper
|id=Vol-3181/paper66
|storemode=property
|title=Mediaeval 2021 Emerging News: Detection of Emerging News from Live News
						Stream Based on Categorization of News Annotations
|pdfUrl=https://ceur-ws.org/Vol-3181/paper66.pdf
|volume=Vol-3181
|authors=Omar Meriwani
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Meriwani21
}}
==Mediaeval 2021 Emerging News: Detection of Emerging News from Live News
						Stream Based on Categorization of News Annotations==
<pdf width="1500px">https://ceur-ws.org/Vol-3181/paper66.pdf</pdf>
<pre>
           Mediaeval 2021 Emerging News: Detection of Emerging
          News from Live News Stream Based on Categorization of
                            News Annotations
                                                                 Omar Meriwani1,
                                            1Scientific Editor, Real Sciences website and magazine

                                             omar.meriwani@gmail.com, omar@real-sciences.com


ABSTRACT                                                                         Our approach focuses on what the more frequently published
                                                                             news could be, and, on news annotations provided from the News
    This paper describes the contribution of RS_OMERIWANI in                 Hunter platform [4] which can provide both the news live stream
the Mediaeval 2021 Emerging News task. Among the various                     as well as the fine categorization of named entities that are
definitions of emerging news, this work is based on the definition           mentioned in the news.
of emerging news as the type of news that would gain more                    In Table 1, we can see some of the news samples with the number
attention from news sources, i.e. higher frequency in publishing             of times they got published during the same two-hour time
the same news. Relying on the categorization of the news                     window. Some gain a lot of attention while others never get
annotations, the classification process has been completed                   published by more than one or two news sources.
through an unsupervised clustering to generate training data for
a supervised neural network model that classifies the news based
on the categories that are mentioned in it. The accuracy score for             Table 1: News pieces with the number of times they got
the final model was 74%, with a 65% F-Score for detecting                       published on different websites on 22nd October 2021
emerging news. The final model fulfilled the requirements of
newsworthiness and completeness of reported events as well as                                        News                       Frequenc
the relevance criteria in the task evaluation.                                                                                      y
                                                                                    Haitian gang leader threatens to kill      26
                                                                                    kidnapped missionaries
                                                                                    Last Known Photos Of Brian Laundrie &      11
1 INTRODUCTION                                                                      Gabby Petito Together
                                                                                    UK palace says queen, 95, spent night in   14
    Journalism is in an ongoing challenge of detecting news
                                                                                    hospital for checks
angles [1] that both interest the readers and satisfy the objectivity               Braun Strowman Says WWE Turned Him         1
requirements of news. Giving this mission to the computer,                          Into A Corporate Monster
requires specifying the exact meaning of emerging news, such as                     Record number of daily vaccinations in     1
the different concepts discussed in [2]. This paper describes the                   Dominican Republic
                                                                                    EXCLUSIVE: Vicki Gunvalson Addresses       2
contribution of RS_OMERIWANI in the Mediaeval 2021 Emerging                         Breakup With Steve Lodget
News task [3].
Having sufficient news samples, it is possible to have the                        This approach is based on the preference of media channels,
computer perform the first part of this task, allowing for                   regardless of any deep analysis of the news content. We assume
automation of the selection process of determining emerging                  that the attention that some news articles may get is based on the
news. News usually gets unbalanced interest from news sources;               nature of named entities it deals with, for example, the first sample
some news stories get published in more than ten main sources of             in Table 1 contains the following categories:
news within a specific country/region/language, while some other             a gang leader, an island, and members of a religious group
news never gets the same level of attention, being published only            (missionaries).
in one or two sources.                                                       It seemed more interesting than the fourth sample which has
This work is done using supervised and unsupervised machine                  categories combination that includes:
learning models and by relying on categories to find the news that           American, wrestler, World Wrestling Entertainment
has higher chances of getting published more frequently in the                    To achieve this approach, the work is divided into two parts:
media.                                                                            1- Finding similar news: Using a technique to cluster the
                                                                                       similar news together, we created new labels based on
2 APPROACH AND METHODS                                                                 the clustering results, and labelled the final training
                                                                                       dataset with 1 or 0 based on the threshold of appearing
                                                                                       three times or more. For example, the first new rows in
Copyright 2021 for this paper by its authors. Use permitted under Creative
                                                                                       Table 1 would be labeled as (1) and the latter three rows
Commons License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online
 MediaEval’21, December 13-15 2021, Online                                                                                   Omar Meriwani

          would be labeled as 0. The label (1) indicates the             emerging news – label (1) – is shown in Table 2, which is close to
          emerging news.                                                 the precision of label (0). Based on the recall score, it could be said
     2-   Supervised classification: Using the resulting dataset,        that the model may flag many varieties of annotations’ categories
          we have vectorized the categories of the news                  as false negatives.
          annotations and used the vector representation as a                The independent human evaluation results stated about
          training data for an artificial neural network to predict      whether the newsworthiness and completeness requirements
          the labels mentioned in the previous step, either (1) for      were satisfied: “Yes, the information provided brings insights that
          emerging news or (0) for other news.                           can conform an event and provides extra information with the
                                                                         keywords that can help to get a fast overview of the reported
                                                                         story. The keywords add some extra information which is not
2.1 Unsupervised news clustering                                         present in the title, helping the journalist to better understand the
       News titles were transformed using a term frequency–              event.”. The evaluation also described the relevance aspect:
inverse document frequency (TFIDF) vectorizer [5] in order to            “provides potentially relevant events for journalist or not widely
create vectors that could be used in the clustering algorithm.           covered events that may have not been yet seen by journalists”
       K-Means algorithm was used to make N/2 clusters of the
original news dataset with total N samples. In that way, it’s            4 DISCUSSION
assumed that the number of clusters will be no less than half the            Categories of named entities or annotations in the news could
total number of samples, enabling us to detect the similarity of the     be used as an auxiliary feature to support more comprehensive
news more accurately.                                                    models. However, the results of categories alone have could fail to
       Due to the computational complexity of the clustering using       detect some emerging news by flagging them as false negatives.
a high number of clusters, the original dataset of ~50K news             Some aspects would still not be covered, such as the sentiments of
samples was divided into 7 batches.                                      verbs that indicate violence which may usually get more
       The final dataset included 12,667 news samples. 4,879 of          attention.
them had been published three times or more by different news                The final output that could be extracted using this method also
sources, and 7,788 news samples had only been published once or          lacks some features that would make it rich enough to be more
twice. The clusters were used to create labels that indicate             realistic for journalists.
clusters, and the news within the same cluster were labeled as (1)           However, the model would still be efficient for working with
or (0) according to the cluster size, the data was also balanced for     poor news details; it can work regardless of the text length, links
labels (0) and (1).                                                      availability, or titles format, because it is only based on the
                                                                         categories of the main entities. Logically it can work on many
2.2 Supervised classification                                            cases when it deals with the abstract attributes of the named
      After the news dataset was created mainly by the indicators        entities that are mentioned in the news.
provided by K-means clustering, the results were ready for
supervised learning model that can classify the data using different     REFERENCES
set of features, namely, the categories of news annotations. The
News Hunter platform already provides annotations for the main
named entities and a set of classes for these annotations. We used a     [1] Marc Gallofré Ocaña and Andreas Lothe Opdahl. 2020.
count vectorizer for each news’ categories set.                              Challenges and opportunities for journalistic knowledge
      We used a multiple-layers perceptron with hidden layers sizes          platforms. Proceedings of the CIKM 2020 Workshops.
of (20,20,20). The data was divided into 2:8 for testing and training.       Galway, Ireland.
      The output format was then structured by returning the news
titles of emerging news as well as the keywords that were extracted      [2] Marc Gallofré Ocaña, Lars Nyre, Andreas Lothe Opdahl,
using the Single Rank algorithm [6] which is implemented in the              Bjørnar Tessem, Christoph Trattner, Csaba Veres. 2018.
Kex Python package.                                                          Towards a big data platform for news angles. The 4th
                                                                             Norwegian Big Data Symposium (NOBIDS).
3 RESULTS                                                                [3] Marc Gallofré Ocaña, Andreas L. Opdahl and Duc-Tien
                                                                             Dang-Nguyen. 2021. Emerging News task: Detecting
          Table 2: Frequency of Special Characters                           emerging events from social media and news feeds.
                                                                             MediaEval’21: Multimedia Evaluation Workshop.
Label        Precision        Recall            F-Score                  [4] Arne Berven, Ole A. Christensen, Sindre Moldeklev, Andreas
                                                                             L. Opdahl, and Kjetil J. Villanger. 2018. News Hunter:
0            0.76             0.86              0.8
                                                                             building and mining knowledge graphs for newsroom
1            0.72             0.58              0.65                         systems. NOKOBIT.

    The accuracy achieved by the model was 74%, while the F-
score for emerging news detection was 65%. Precision for
Emerging News: Detecting emerging stories from social media and news feeds   Omar Meriwani

[5] T. Joachims, "A Probabilistic Analysis of the Rocchio
    Algorithm with TFIDF for Text Categorization," Carnegie-
    mellon univ pittsburgh pa dept of computer science, 1996.
[6] C. G. Broyden, "The convergence of single-rank quasi-
    Newton methods," Mathematics of Computation, no. 24.110 ,
    pp. 365-382, 1970.


                                                                                             3

</pre>